Workshop on Multiword Expressions (2018)

Volumes

Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018) 34 papers

pdf (full)
bib (full) Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)

pdf bib abs
Annotation Schemes for Surface Construction Labeling
Lori Levin

In this talk I will describe the interaction of linguistics and language technologies in Surface Construction Labeling (SCL) from the perspective of corpus annotation tasks such as definiteness, modality, and causality. Linguistically, following Construction Grammar, SCL recognizes that meaning may be carried by morphemes, words, or arbitrary constellations of morpho-lexical elements. SCL is like Shallow Semantic Parsing in that it does not attempt a full compositional analysis of meaning, but rather identifies only the main elements of a semantic frame, where the frames may be invoked by constructions as well as lexical items. Computationally, SCL is different from tasks such as information extraction in that it deals only with meanings that are expressed in a conventional, grammaticalized way and does not address inferred meanings. I review the work of Dunietz (2018) on the labeling of causal frames including causal connectives and cause and effect arguments. I will describe how to design an annotation scheme for SCL, including isolating basic units of form and meaning and building a “constructicon”. I will conclude with remarks about the nature of universal categories and universal meaning representations in language technologies. This talk describes joint work with Jaime Carbonell, Jesse Dunietz, Nathan Schneider, and Miriam Petruck.

pdf bib abs
From Lexical Functional Grammar to Enhanced Universal Dependencies
Adam Przepiórkowski | Agnieszka Patejuk

This is a summary of an invited talk.

pdf bib abs
Leaving no token behind: comprehensive (and delicious) annotation of MWEs and supersenses
Nathan Schneider

I will describe an unorthodox approach to lexical semantic annotation that prioritizes corpus coverage, democratizing analysis of a wide range of expression types. I argue that a lexicon-free lexical semantics—defined in terms of units and supersense tags—is an appetizing direction for NLP, as it is robust, cost-effective, easily understood, not too language-specific, and can serve as a foundation for richer semantic structure. Linguistic delicacies from the STREUSLE and DiMSUM corpora, which have been multiword- and supersense-annotated, attest to the veritable smörgåsbord of noncanonical constructions in English, including various flavors of prepositions, MWEs, and other curiosities. Bio: Nathan Schneider is an annotation schemer and computational modeler for natural language. As Assistant Professor of Linguistics and Computer Science at Georgetown University, he looks for synergies between practical language technologies and the scientific study of language. He specializes in broad-coverage semantic analysis: designing linguistic meaning representations, annotating them in corpora, and automating them with statistical natural language processing techniques. A central focus in this research is the nexus between grammar and lexicon as manifested in multiword expressions and adpositions/case markers. He has inhabited UC Berkeley (BA in Computer Science and Linguistics), Carnegie Mellon University (Ph.D. in Language Technologies), and the University of Edinburgh (postdoc). Now a Hoya and leader of NERT, he continues to play with data and algorithms for linguistic meaning.

pdf bib abs
Processing MWEs: Neurocognitive Bases of Verbal MWEs and Lexical Cohesiveness within MWEs
Shohini Bhattasali | Murielle Fabre | John Hale

Multiword expressions have posed a challenge in the past for computational linguistics since they comprise a heterogeneous family of word clusters and are difficult to detect in natural language data. In this paper, we present a fMRI study based on language comprehension to provide neuroimaging evidence for processing MWEs. We investigate whether different MWEs have distinct neural bases, e.g. if verbal MWEs involve separate brain areas from non-verbal MWEs and if MWEs with varying levels of cohesiveness activate dissociable brain regions. Our study contributes neuroimaging evidence illustrating that different MWEs elicit spatially distinct patterns of activation. We also adapt an association measure, usually used to detect MWEs, as a cognitively plausible metric for language processing.

pdf bib abs
The Interplay of Form and Meaning in Complex Medical Terms: Evidence from a Clinical Corpus
Leonie Grön | Ann Bertels | Kris Heylen

We conduct a corpus study to investigate the structure of multi-word expressions (MWEs) in the clinical domain. Based on an existing medical taxonomy, we develop an annotation scheme and label a sample of MWEs from a Dutch corpus with semantic and grammatical features. The analysis of the annotated data shows that the formal structure of clinical MWEs correlates with their conceptual properties. The insights gained from this study could inform the design of Natural Language Processing (NLP) systems for clinical writing, but also for other specialized genres.

pdf bib abs
Discourse and Lexicons: Lexemes, MWEs, Grammatical Constructions and Compositional Word Combinations to Signal Discourse Relations
Laurence Danlos

Lexicons generally record a list of lexemes or non-compositional multiword expressions. We propose to build lexicons for compositional word combinations, namely “secondary discourse connectives”. Secondary discourse connectives play the same function as “primary discourse connectives” but the latter are either lexemes or non-compositional multiword expressions. The paper defines primary and secondary connectives, and explains why it is possible to build a lexicon for the compositional ones and how it could be organized. It also puts forward the utility of such a lexicon in discourse annotation and parsing. Finally, it opens the discussion on the constructions that signal a discourse relation between two spans of text.

pdf bib abs
From Chinese Word Segmentation to Extraction of Constructions: Two Sides of the Same Algorithmic Coin
Jean-Pierre Colson

This paper presents the results of two experiments carried out within the framework of computational construction grammar. Starting from the constructionist point of view that there are just constructions in language, including lexical ones, we tested the validity of a clustering algorithm that was primarily designed for MWE extraction, the cpr-score (Colson, 2017), on Chinese word segmentation. Our results indicate a striking recall rate of 75 percent without any special adaptation to Chinese or to the lexicon, which confirms that there is some similarity between extracting MWEs and CWS. Our second experiment also suggests that the same methodology might be used for extracting more schematic or abstract constructions, thereby providing evidence for the statistical foundation of construction grammar.

pdf bib abs
Fixed Similes: Measuring aspects of the relation between MWE idiomatic semantics and syntactic flexibility
Stella Markantonatou | Panagiotis Kouris | Yanis Maistros

We shed light on aspects of the relation between the semantics and the syntactic flexibility of multiword expressions by investigating fixed adjective similes (FS), a predicative multiword expression class not studied in this respect before. We find that only a subset of the syntactic structures observed in the data are related with idiomaticity. We identify and measure two aspects of idiomaticity, one of which seems to allow for predictions about FS syntactic flexibility. Our research draws on a resource developed with the semantic and detailed syntactic annotation of web-retrieved Modern Greek material, indicating frequency of use of the individual similes.

pdf bib abs
Fine-Grained Termhood Prediction for German Compound Terms Using Neural Networks
Anna Hätty | Sabine Schulte im Walde

Automatic term identification and investigating the understandability of terms in a specialized domain are often treated as two separate lines of research. We propose a combined approach for this matter, by defining fine-grained classes of termhood and framing a classification task. The classes reflect tiers of a term’s association to a domain. The new setup is applied to German closed compounds as term candidates in the domain of cooking. For the prediction of the classes, we compare several neural network architectures and also take salient information about the compounds’ components into account. We show that applying a similar class distinction to the compounds’ components and propagating this information within the network improves the compound class prediction results.

pdf bib abs
Towards a Computational Lexicon for Moroccan Darija: Words, Idioms, and Constructions
Jamal Laoudi | Claire Bonial | Lucia Donatelli | Stephen Tratz | Clare Voss

In this paper, we explore the challenges of building a computational lexicon for Moroccan Darija (MD), an Arabic dialect spoken by over 32 million people worldwide but which only recently has begun appearing frequently in written form in social media. We raise the question of what belongs in such a lexicon and start by describing our work building traditional word-level lexicon entries with their English translations. We then discuss challenges in translating idiomatic MD text that led to creating multi-word expression lexicon entries whose meanings could not be fully derived from the individual words. Finally, we provide a preliminary exploration of constructions to be considered for inclusion in an MD constructicon by translating examples of English constructions and examining their MD counterparts.

This paper presents a Basque corpus where Verbal Multiword Expressions (VMWEs) were annotated following universal guidelines. Information on the annotation is given, and some ideas for discussion upon the guidelines are also proposed. The corpus is useful not only for NLP-related research, but also to draw conclusions on Basque phraseology in comparison with other languages.

pdf bib abs
Annotation of Tense and Aspect Semantics for Sentential AMR
Lucia Donatelli | Michael Regan | William Croft | Nathan Schneider

Although English grammar encodes a number of semantic contrasts with tense and aspect marking, these semantics are currently ignored by Abstract Meaning Representation (AMR) annotations. This paper extends sentence-level AMR to include a coarse-grained treatment of tense and aspect semantics. The proposed framework augments the representation of finite predications to include a four-way temporal distinction (event time before, up to, at, or after speech time) and several aspectual distinctions (including static vs. dynamic, habitual vs. episodic, and telic vs. atelic). This will enable AMR to be used for NLP tasks and applications that require sophisticated reasoning about time and event structure.

pdf bib abs
A Syntax-Based Scheme for the Annotation and Segmentation of German Spoken Language Interactions
Swantje Westpfahl | Jan Gorisch

Unlike corpora of written language where segmentation can mainly be derived from orthographic punctuation marks, the basis for segmenting spoken language corpora is not predetermined by the primary data, but rather has to be established by the corpus compilers. This impedes consistent querying and visualization of such data. Several ways of segmenting have been proposed, some of which are based on syntax. In this study, we developed and evaluated annotation and segmentation guidelines in reference to the topological field model for German. We can show that these guidelines are used consistently across annotators. We also investigated the influence of various interactional settings with a rather simple measure, the word-count per segment and unit-type. We observed that the word count and the distribution of each unit type differ in varying interactional settings and that our developed segmentation and annotation guidelines are used consistently across annotators. In conclusion, our syntax-based segmentations reflect interactional properties that are intrinsic to the social interactions that participants are involved in. This can be used for further analysis of social interaction and opens the possibility for automatic segmentation of transcripts.

pdf bib abs
An Annotated Corpus of Picture Stories Retold by Language Learners
Christine Köhn | Arne Köhn

Corpora with language learner writing usually consist of essays, which are difficult to annotate reliably and to process automatically due to the high degree of freedom and the nature of learner language. We develop a task which mildly constrains learner utterances to facilitate consistent annotation and reliable automatic processing but at the same time does not prime learners with textual information. In this task, learners retell a comic strip. We present the resulting task-based corpus of stories written by learners of German. We designed the corpus to be able to serve multiple purposes: The corpus was manually annotated, including target hypotheses and syntactic structures. We achieve a very high inter-annotator agreement: κ = 0.765 for the annotation of minimal target hypotheses and κ = 0.507 for the extended target hypotheses. We attribute this to the design of our task and the annotation guidelines, which are based on those for the Falko corpus (Reznicek et al., 2012).

When a hazard such as a hurricane threatens, people are forced to make a wide variety of decisions, and the information they receive and produce can influence their own and others’ actions. As social media grows more popular, an increasing number of people are using social media platforms to obtain and share information about approaching threats and discuss their interpretations of the threat and their protective decisions. This work aims to improve understanding of natural disasters through social media and provide an annotation scheme to identify themes in user’s social media behavior and facilitate efforts in supervised machine learning. To that end, this work has three contributions: (1) the creation of an annotation scheme to consistently identify hazard-related themes in Twitter, (2) an overview of agreement rates and difficulties in identifying annotation categories, and (3) a public release of both the dataset and guidelines developed from this scheme.

pdf bib abs
A Treebank for the Healthcare Domain
Nganthoibi Oinam | Diwakar Mishra | Pinal Patel | Narayan Choudhary | Hitesh Desai

This paper presents a treebank for the healthcare domain developed at ezDI. The treebank is created from a wide array of clinical health record documents across hospitals. The data has been de-identified and annotated for constituent syntactic structure. The treebank contains a total of 52053 sentences that have been sampled for subdomains as well as linguistic variations. The paper outlines the sampling process followed to ensure a better domain representation in the corpus, the annotation process and challenges, and corpus statistics. The Penn Treebank tagset and guidelines were largely followed, but there were many syntactic contexts that warranted adaptation of the guidelines. The treebank created was used to re-train the Berkeley parser and the Stanford parser. These parsers were also trained with the GENIA treebank for comparative quality assessment. Our treebank yielded great-er accuracy on both parsers. Berkeley parser performed better on our treebank with an average F1 measure of 91 across 5-folds. This was a significant jump from the out-of-the-box F1 score of 70 on Berkeley parser’s default grammar.

pdf bib abs
The RST Spanish-Chinese Treebank
Shuyuan Cao | Iria da Cunha | Mikel Iruskieta

Discourse analysis is necessary for different tasks of Natural Language Processing (NLP). As two of the most spoken languages in the world, discourse analysis between Spanish and Chinese is important for NLP research. This paper aims to present the first open Spanish-Chinese parallel corpus annotated with discourse information, whose theoretical framework is based on the Rhetorical Structure Theory (RST). We have evaluated and harmonized each annotation part to obtain a high annotated-quality corpus. The corpus is already available to the public.

pdf bib abs
All Roads Lead to UD: Converting Stanford and Penn Parses to English Universal Dependencies with Multilayer Annotations
Siyao Peng | Amir Zeldes

We describe and evaluate different approaches to the conversion of gold standard corpus data from Stanford Typed Dependencies (SD) and Penn-style constituent trees to the latest English Universal Dependencies representation (UD 2.2). Our results indicate that pure SD to UD conversion is highly accurate across multiple genres, resulting in around 1.5% errors, but can be improved further to fewer than 0.5% errors given access to annotations beyond the pure syntax tree, such as entity types and coreference resolution, which are necessary for correct generation of several UD relations. We show that constituent-based conversion using CoreNLP (with automatic NER) performs substantially worse in all genres, including when using gold constituent trees, primarily due to underspecification of phrasal grammatical functions.

pdf bib abs
The Other Side of the Coin: Unsupervised Disambiguation of Potentially Idiomatic Expressions by Contrasting Senses
Hessel Haagsma | Malvina Nissim | Johan Bos

Disambiguation of potentially idiomatic expressions involves determining the sense of a potentially idiomatic expression in a given context, e.g. determining that make hay in ‘Investment banks made hay while takeovers shone.’ is used in a figurative sense. This enables automatic interpretation of idiomatic expressions, which is important for applications like machine translation and sentiment analysis. In this work, we present an unsupervised approach for English that makes use of literalisations of idiom senses to improve disambiguation, which is based on the lexical cohesion graph-based method by Sporleder and Li (2009). Experimental results show that, while literalisation carries novel information, its performance falls short of that of state-of-the-art unsupervised methods.

pdf bib abs
Do Character-Level Neural Network Language Models Capture Knowledge of Multiword Expression Compositionality?
Ali Hakimi Parizi | Paul Cook

In this paper, we propose the first model for multiword expression (MWE) compositionality prediction based on character-level neural network language models. Experimental results on two kinds of MWEs (noun compounds and verb-particle constructions) and two languages (English and German) suggest that character-level neural network language models capture knowledge of multiword expression compositionality, in particular for English noun compounds and the particle component of English verb-particle constructions. In contrast to many other approaches to MWE compositionality prediction, this character-level approach does not require token-level identification of MWEs in a training corpus, and can potentially predict the compositionality of out-of-vocabulary MWEs.

This paper describes the construction and annotation of a corpus of verbal MWEs for English, as part of the PARSEME Shared Task 1.1 on automatic identification of verbal MWEs. The criteria for corpus selection, the categories of MWEs used, and the training process are discussed, along with the particular issues that led to revisions in edition 1.1 of the annotation guidelines. Finally, an overview of the characteristics of the final annotated corpus is presented, as well as some discussion on inter-annotator agreement.

pdf bib abs
Cooperating Tools for MWE Lexicon Management and Corpus Annotation
Yuji Matsumoto | Akihiko Kato | Hiroyuki Shindo | Toshio Morita

We present tools for lexicon and corpus management that offer cooperating functionality in corpus annotation. The former, named Cradle, stores a set of words and expressions where multi-word expressions are defined with their own part-of-speech information and internal syntactic structures. The latter, named ChaKi, manages text corpora with part-of-speech (POS) and syntactic dependency structure annotations. Those two tools cooperate so that the words and multi-word expressions stored in Cradle are directly referred to by ChaKi in conducting corpus annotation, and the words and expressions annotated in ChaKi can be output as a list of lexical entities that are to be stored in Cradle.

pdf bib abs
“Fingers in the Nose”: Evaluating Speakers’ Identification of Multi-Word Expressions Using a Slightly Gamified Crowdsourcing Platform
Karën Fort | Bruno Guillaume | Matthieu Constant | Nicolas Lefèbvre | Yann-Alan Pilatte

This article presents the results we obtained in crowdsourcing French speakers’ intuition concerning multi-work expressions (MWEs). We developed a slightly gamified crowdsourcing platform, part of which is designed to test users’ ability to identify MWEs with no prior training. The participants perform relatively well at the task, with a recall reaching 65% for MWEs that do not behave as function words.

pdf bib abs
Improving Domain Independent Question Parsing with Synthetic Treebanks
Halim-Antoine Boukaram | Nizar Habash | Micheline Ziadee | Majd Sakr

Automatic syntactic parsing for question constructions is a challenging task due to the paucity of training examples in most treebanks. The near absence of question constructions is due to the dominance of the news domain in treebanking efforts. In this paper, we compare two synthetic low-cost question treebank creation methods with a conventional manual high-cost annotation method in the context of three domains (news questions, political talk shows, and chatbots) for Modern Standard Arabic, a language with relatively low resources and rich morphology. Our results show that synthetic methods can be effective at significantly reducing parsing errors for a target domain without having to invest large resources on manual annotation; and the combination of manual and synthetic methods is our best domain-independent performer.

This paper describes the PARSEME Shared Task 1.1 on automatic identification of verbal multiword expressions. We present the annotation methodology, focusing on changes from last year’s shared task. Novel aspects include enhanced annotation guidelines, additional annotated data for most languages, corpora for some new languages, and new evaluation settings. Corpora were created for 20 languages, which are also briefly discussed. We report organizational principles behind the shared task and the evaluation metrics employed for ranking. The 17 participating systems, their methods and obtained results are also presented and analysed.

pdf bib abs
CRF-Seq and CRF-DepTree at PARSEME Shared Task 2018: Detecting Verbal MWEs using Sequential and Dependency-Based Approaches
Erwan Moreau | Ashjan Alsulaimani | Alfredo Maldonado | Carl Vogel

This paper describes two systems for detecting Verbal Multiword Expressions (VMWEs) which both competed in the closed track at the PARSEME VMWE Shared Task 2018. CRF-DepTree-categs implements an approach based on the dependency tree, intended to exploit the syntactic and semantic relations between tokens; CRF-Seq-nocategs implements a robust sequential method which requires only lemmas and morphosyntactic tags. Both systems ranked in the top half of the ranking, the latter ranking second for token-based evaluation. The code for both systems is published under the GNU General Public License version 3.0 and is available at http://github.com/erwanm/adapt-vmwe18.

pdf bib abs
Deep-BGT at PARSEME Shared Task 2018: Bidirectional LSTM-CRF Model for Verbal Multiword Expression Identification
Gözde Berk | Berna Erden | Tunga Güngör

This paper describes the Deep-BGT system that participated to the PARSEME shared task 2018 on automatic identification of verbal multiword expressions (VMWEs). Our system is language-independent and uses the bidirectional Long Short-Term Memory model with a Conditional Random Field layer on top (bidirectional LSTM-CRF). To the best of our knowledge, this paper is the first one that employs the bidirectional LSTM-CRF model for VMWE identification. Furthermore, the gappy 1-level tagging scheme is used for discontiguity and overlaps. Our system was evaluated on 10 languages in the open track and it was ranked the second in terms of the general ranking metric.

pdf bib abs
GBD-NER at PARSEME Shared Task 2018: Multi-Word Expression Detection Using Bidirectional Long-Short-Term Memory Networks and Graph-Based Decoding
Tiberiu Boros | Ruxandra Burtica

This paper addresses the issue of multi-word expression (MWE) detection by employing a new decoding strategy inspired after graph-based parsing. We show that this architecture achieves state-of-the-art results with minimum feature-engineering, just by relying on lexicalized and morphological attributes. We validate our approach in a multilingual setting, using standard MWE corpora supplied in the PARSEME Shared Task.

pdf bib abs
Mumpitz at PARSEME Shared Task 2018: A Bidirectional LSTM for the Identification of Verbal Multiword Expressions
Rafael Ehren | Timm Lichte | Younes Samih

In this paper, we describe Mumpitz, the system we submitted to the PARSEME Shared task on automatic identification of verbal multiword expressions (VMWEs). Mumpitz consists of a Bidirectional Recurrent Neural Network (BRNN) with Long Short-Term Memory (LSTM) units and a heuristic that leverages the dependency information provided in the PARSEME corpus data to differentiate VMWEs in a sentence. We submitted results for seven languages in the closed track of the task and for one language in the open track. For the open track we used the same system, but with pretrained instead of randomly initialized word embeddings to improve the system performance.

pdf bib abs
TRAPACC and TRAPACCS at PARSEME Shared Task 2018: Neural Transition Tagging of Verbal Multiword Expressions
Regina Stodden | Behrang QasemiZadeh | Laura Kallmeyer

We describe the TRAPACC system and its variant TRAPACCS that participated in the closed track of the PARSEME Shared Task 2018 on labeling verbal multiword expressions (VMWEs). TRAPACC is a modified arc-standard transition system based on Constant and Nivre’s (2016) model of joint syntactic and lexical analysis in which the oracle is approximated using a classifier. For TRAPACC, the classifier consists of a data-independent dimension reduction and a convolutional neural network (CNN) for learning and labelling transitions. TRAPACCS extends TRAPACC by replacing the softmax layer of the CNN with a support vector machine (SVM). We report the results obtained for 19 languages, for 8 of which our system yields the best results compared to other participating systems in the closed-track of the shared task.

pdf bib abs
TRAVERSAL at PARSEME Shared Task 2018: Identification of Verbal Multiword Expressions Using a Discriminative Tree-Structured Model
Jakub Waszczuk

This paper describes a system submitted to the closed track of the PARSEME shared task (edition 1.1) on automatic identification of verbal multiword expressions (VMWEs). The system represents VMWE identification as a labeling task where one of two labels (MWE or not-MWE) must be predicted for each node in the dependency tree based on local context, including adjacent nodes and their labels. The system relies on multiclass logistic regression to determine the globally optimal labeling of a tree. The system ranked 1st in the general cross-lingual ranking of the closed track systems, according to both official evaluation measures: MWE-based F1 and token-based F1.

pdf bib abs
VarIDE at PARSEME Shared Task 2018: Are Variants Really as Alike as Two Peas in a Pod?
Caroline Pasquer | Carlos Ramisch | Agata Savary | Jean-Yves Antoine

We describe the VarIDE system (standing for Variant IDEntification) which participated in the edition 1.1 of the PARSEME shared task on automatic identification of verbal multiword expressions (VMWEs). Our system focuses on the task of VMWE variant identification by using morphosyntactic information in the training data to predict if candidates extracted from the test corpus could be idiomatic, thanks to a naive Bayes classifier. We report results for 19 languages.

pdf bib abs
Veyn at PARSEME Shared Task 2018: Recurrent Neural Networks for VMWE Identification
Nicolas Zampieri | Manon Scholivet | Carlos Ramisch | Benoit Favre

This paper describes the Veyn system, submitted to the closed track of the PARSEME Shared Task 2018 on automatic identification of verbal multiword expressions (VMWEs). Veyn is based on a sequence tagger using recurrent neural networks. We represent VMWEs using a variant of the begin-inside-outside encoding scheme combined with the VMWE category tag. In addition to the system description, we present development experiments to determine the best tagging scheme. Veyn is freely available, covers 19 languages, and was ranked ninth (MWE-based) and eight (Token-based) among 13 submissions, considering macro-averaged F1 across languages.