Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

Proceedings of the Second Workshop on Universal Dependencies (UDW 2018) Marie-Catherine de Marneffe Teresa Lynn Sebastian Schuster November 2018

Brussels, Belgium

Association for Computational Linguistics http://www.aclweb.org/anthology/W18-60 book UDW2018:2018 Assessing the Impact of Incremental Error Detection and Correction. A Case Study on the Italian Universal Dependency Treebank ChiaraAlzetta FeliceDell'Orletta SimonettaMontemagni MariaSimi GiuliaVenturi Proceedings of the Second Workshop on Universal Dependencies (UDW 2018) November 2018

Brussels, Belgium

Association for Computational Linguistics 1–7 http://www.aclweb.org/anthology/W18-6001 Detection and correction of errors and inconsistencies in "gold treebanks" are becoming more and more central topics of corpus annotation. The paper illustrates a new incremental method for enhancing treebanks, with particular emphasis on the extension of error patterns across different textual genres and registers. Impact and role of corrections have been assessed in a dependency parsing experiment carried out with four different parsers, whose results are promising. For both evaluation datasets, the performance of parsers increases, in terms of the standard LAS and UAS measures and of a more focused measure taking into account only relations involved in error patterns, as well as at the level of individual dependencies. inproceedings alzetta-EtAl:2018:UDW2018 Using Universal Dependencies in cross-linguistic complexity research AleksandrsBerdicevskis ÇağrıÇöltekin KatharinaEhret Kiluvon Prince DanielRoss BillThompson ChunxiaoYan VeraDemberg GaryLupyan TarakaRama ChristianBentz Proceedings of the Second Workshop on Universal Dependencies (UDW 2018) November 2018

Brussels, Belgium

Association for Computational Linguistics 8–17 http://www.aclweb.org/anthology/W18-6002 We evaluate corpus-based measures of linguistic complexity obtained using Universal Dependencies (UD) treebanks. We propose a method of estimating robustness of the complexity values obtained using a given measure and a given treebank. The results indicate that measures of syntactic complexity might be on average less robust than those of morphological complexity. We also estimate the validity of complexity measures by comparing the results for very similar languages and checking for unexpected differences. We show that some of those differences that arise can be diminished by using parallel treebanks and, more importantly from the practical point of view, by harmonizing the language-specific solutions in the UD annotation. inproceedings berdicevskis-EtAl:2018:UDW2018 Expletives in Universal Dependency Treebanks GosseBouma JanHajic DagHaug JoakimNivre Per ErikSolberg LiljaØvrelid Proceedings of the Second Workshop on Universal Dependencies (UDW 2018) November 2018

Brussels, Belgium

Association for Computational Linguistics 18–26 http://www.aclweb.org/anthology/W18-6003 Although treebanks annotated according to the guidelines of Universal Dependencies (UD) now exist for many languages, the goal of annotating the same phenomena in a cross-linguistically consistent fashion is not always met. inproceedings bouma-EtAl:2018:UDW2018 Challenges in Converting the Index Thomisticus Treebank into Universal Dependencies Flavio MassimilianoCecchini MarcoPassarotti PaolaMarongiu DanielZeman Proceedings of the Second Workshop on Universal Dependencies (UDW 2018) November 2018

Brussels, Belgium

Association for Computational Linguistics 27–36 http://www.aclweb.org/anthology/W18-6004 This paper describes the changes applied to the original process used to convert the Index Thomisticus Treebank, a corpus including texts in Medieval Latin by Thomas Aquinas, into the annotation style of Universal Dependencies. The changes are made both to harmonise the Universal Dependencies version of the Index Thomisticus Treebank with the two other available Latin treebanks and to fix errors and inconsistencies resulting from the original process. The paper details the treatment of different issues in PoS tagging, lemmatisation and assignment of dependency relations. Finally, it assesses the quality of the new conversion process by providing an evaluation against a gold standard. inproceedings cecchini-EtAl:2018:UDW2018 Er ... well, it matters, right? On the role of data representations in spoken language dependency parsing KajaDobrovoljc MatejMartinc Proceedings of the Second Workshop on Universal Dependencies (UDW 2018) November 2018

Brussels, Belgium

Association for Computational Linguistics 37–46 http://www.aclweb.org/anthology/W18-6005 Despite the significant improvement of data-driven dependency parsing systems in recent years, they still achieve a considerably lower performance in parsing spoken language data in comparison to written data. On the example of Spoken Slovenian Treebank, the first spoken data treebank using the UD annotation scheme, we investigate which speech-specific phenomena undermine parsing performance, through a series of training data and treebank modification experiments using two distinct state-of-the-art parsing systems. Our results show that segmentation is the most prominent cause of low parsing performance, both in parsing raw and pre-segmented transcriptions. In addition to shorter utterances, both parsers perform better on normalized transcriptions including basic markers of prosody and excluding disfluencies, discourse markers and fillers. On the other hand, the effects of written training data addition and speech-specific dependency representations largely depend on the parsing system selected. inproceedings dobrovoljc-martinc:2018:UDW2018 Mind the Gap: Data Enrichment in Dependency Parsing of Elliptical Constructions KiraDroganova FilipGinter JennaKanerva DanielZeman Proceedings of the Second Workshop on Universal Dependencies (UDW 2018) November 2018

Brussels, Belgium

Association for Computational Linguistics 47–54 http://www.aclweb.org/anthology/W18-6006 In this paper, we focus on parsing rare and non-trivial constructions, in particular ellipsis. We report on several experiments in enrichment of training data for this specific construction, evaluated on five languages: Czech, English, Finnish, Russian and Slovak. inproceedings droganova-EtAl:2018:UDW2018 Integration complexity and the order of cosisters WilliamDyer Proceedings of the Second Workshop on Universal Dependencies (UDW 2018) November 2018

Brussels, Belgium

Association for Computational Linguistics 55–65 http://www.aclweb.org/anthology/W18-6007 The cost of integrating dependent constituents to their heads is thought to involve the distance between dependent and head and the complexity of the integration (Gibson, 1998). The former has been convincingly addressed by Dependency Distance Minimization (DDM) (cf. Liu et al., 2017). The current study addresses the latter by proposing a novel theory of integration complexity derived from the entropy of the probability distribution of a dependent’s heads. An analysis of Universal Dependency corpora provides empirical evidence regarding the preferred order of isomorphic cosisters—sister constituents of the same syntactic form on the same side of their head—such as the adjectives in "pretty blue fish." Integration complexity, alongside DDM, allows for a general theory of constituent order based on integration cost. inproceedings dyer:2018:UDW2018 SUD or Surface-Syntactic Universal Dependencies: An annotation scheme near-isomorphic to UD KimGerdes BrunoGuillaume SylvainKahane GuyPerrier Proceedings of the Second Workshop on Universal Dependencies (UDW 2018) November 2018

Brussels, Belgium

Association for Computational Linguistics 66–74 http://www.aclweb.org/anthology/W18-6008 This article proposes a surface-syntactic annotation scheme called SUD that is near-isomorphic to the Universal Dependencies (UD) annotation scheme while following distributional criteria for defining the dependency tree structure and the naming of the syntactic functions. Rule-based graph transformation grammars allow for a bi-directional transformation of UD into SUD. The back-and-forth transformation can also be seen as a powerful error-mining tool to assure the intra-language and inter-language coherence of the UD treebanks. inproceedings gerdes-EtAl:2018:UDW2018 Coordinate Structures in Universal Dependencies for Head-final Languages HiroshiKanayama Na-RaeHan MasayukiAsahara Jena D.Hwang YusukeMiyao Jinho D.Choi YujiMatsumoto Proceedings of the Second Workshop on Universal Dependencies (UDW 2018) November 2018

Brussels, Belgium

Association for Computational Linguistics 75–84 http://www.aclweb.org/anthology/W18-6009 This paper discusses the representation of coordinate structures in the Universal Dependencies framework for two head-final languages,Japanese and Korean. UD applies a strict principle that makes the head of coordination the left-most conjunct. However, the guideline may produce syntactic trees which are difficult to accept in head-final languages. This paper describes the status in the current corpora and proposes alternative designs suitable for these languages. inproceedings kanayama-EtAl:2018:UDW2018 Investigating NP-Chunking with Universal Dependencies for English OphélieLacroix Proceedings of the Second Workshop on Universal Dependencies (UDW 2018) November 2018

Brussels, Belgium

Association for Computational Linguistics 85–90 http://www.aclweb.org/anthology/W18-6010 Chunking is a pre-processing task generally dedicated to improving constituency parsing. In this paper, we want to show that universal dependency (UD) parsing can also leverage the information provided by the task of chunking even though annotated chunks are not provided with universal dependency trees. In particular, we introduce the possibility of deducing noun-phrase (NP) chunks from universal dependencies, focusing on English as a first example. We then demonstrate how the task of NP-chunking can benefit PoS-tagging in a multi-task learning setting - comparing two different strategies - and how it can be used as a feature for dependency parsing in order to learn enriched models. inproceedings lacroix:2018:UDW2018 Marrying Universal Dependencies and Universal Morphology Arya D.McCarthy MiikkaSilfverberg RyanCotterell MansHulden DavidYarowsky Proceedings of the Second Workshop on Universal Dependencies (UDW 2018) November 2018

Brussels, Belgium

Association for Computational Linguistics 91–101 http://www.aclweb.org/anthology/W18-6011 The Universal Dependencies (UD) and Universal Morphology (UniMorph) projects each present schemata for annotating the morphosyntactic details of language. Each project also provides corpora of annotated text in many languages—UD at the token level and UniMorph at the type level. As each corpus is built by different annotators, language-specific decisions hinder the goal of universal schemata. With compatibility of tags, each project’s annotations could be used to validate the other’s. Additionally, the availability of both type- and token-level resources would be a boon to tasks such as parsing and homograph disambiguation. To ease this interoperability, we present a deterministic mapping from Universal Dependencies v2 features into the UniMorph schema. We validate our approach by lookup in the UniMorph corpora and find a macro-average of 64.13% recall. We also note incompatibilities due to paucity of data on either side. Finally, we present a critical evaluation of the foundations, strengths, and weaknesses of the two annotation projects. inproceedings mccarthy-EtAl:2018:UDW2018 Enhancing Universal Dependency Treebanks: A Case Study JoakimNivre PaolaMarongiu FilipGinter JennaKanerva SimonettaMontemagni SebastianSchuster MariaSimi Proceedings of the Second Workshop on Universal Dependencies (UDW 2018) November 2018

Brussels, Belgium

Association for Computational Linguistics 102–107 http://www.aclweb.org/anthology/W18-6012 We evaluate cross-lingual techniques for adding enhanced dependencies to existing treebanks in Universal Dependencies. We apply a rule-based system for English and a data-driven system trained on Finnish to Swedish and Italian. We find that both systems are accurate enough to bootstrap enhanced dependencies in existing UD treebanks. For Italian, results are even on par with those of a language-specific system. inproceedings nivre-EtAl:2018:UDW2018 Enhancing Universal Dependencies for Korean YoungbinNoh JiyoonHan Tae HwanOh HansaemKim Proceedings of the Second Workshop on Universal Dependencies (UDW 2018) November 2018

Brussels, Belgium

Association for Computational Linguistics 108–116 http://www.aclweb.org/anthology/W18-6013 In this paper, for the purpose of enhancing Universal Dependencies for the Korean language, we propose a modified method for mapping Korean Part-of-Speech(POS) tagset in relation to Universal Part-of-Speech (UPOS) tagset in order to enhance the Universal Dependencies for the Korean Language. Previous studies suggest that UPOS reflects several issues that influence dependency annotation by using the POS of Korean predicates, particularly the distinctiveness in using verb, adjective, and copula. inproceedings noh-EtAl:2018:UDW2018 UD-Japanese BCCWJ: Universal Dependencies Annotation for the Balanced Corpus of Contemporary Written Japanese MaiOmura MasayukiAsahara Proceedings of the Second Workshop on Universal Dependencies (UDW 2018) November 2018

Brussels, Belgium

Association for Computational Linguistics 117–125 http://www.aclweb.org/anthology/W18-6014 In this paper, we describe a corpus UD Japanese-BCCWJ that was created by converting the Balanced Corpus of Contemporary Written Japanese (BCCWJ), a Japanese language corpus, to adhere to the UD annotation schema. The BCCWJ already assigns dependency information at the level of the bunsetsu (a Japanese syntactic unit comparable to the phrase). We developed a program to convert the BCCWJ to UD based on this dependency structure, and this corpus is the result of completely automatic conversion using the program. UD Japanese-BCCWJ is the largest-scale UD Japanese corpus and the second-largest of all UD corpora, including 1,980 documents, 57,109 sentences, and 1,273k words across six distinct domains. inproceedings omura-asahara:2018:UDW2018 The First Komi-Zyrian Universal Dependencies Treebanks NikoPartanen RogierBlokland KyungTaeLim ThierryPoibeau MichaelRießler Proceedings of the Second Workshop on Universal Dependencies (UDW 2018) November 2018

Brussels, Belgium

Association for Computational Linguistics 126–132 http://www.aclweb.org/anthology/W18-6015 Two Komi-Zyrian treebanks were included in the Universal Dependencies 2.2 release. This article contextualizes the treebanks, discusses the process through which they were created, and outlines the future plans and timeline for the next improvements. Special attention is paid to the possibilities of using UD in the documentation and description of endangered languages. inproceedings partanen-EtAl:2018:UDW2018 The Hebrew Universal Dependency Treebank: Past Present and Future ShovalSade AmitSeker ReutTsarfaty Proceedings of the Second Workshop on Universal Dependencies (UDW 2018) November 2018

Brussels, Belgium

Association for Computational Linguistics 133–143 http://www.aclweb.org/anthology/W18-6016 The Hebrew treebank (HTB), consisting of 6221 morpho-syntactically annotated newspaper sentences, has been the only resource for training and validating Hebrew statistical parsers for almost two decades now. During these decades, the HTB has gone through a trajectory of automatic and semi-automatic conversions, until arriving at its current UDv2 form. In this work we set out to manually validate the UDv2 version and, accordingly, we apply scheme changes to bring the UD HTB into the same theoretical ground as the rest of UD. Our experimental results show that improving the linguistic coherence and internal consistency of the UD HTB has indeed led to improved syntactic parsing performance. At the same time, there is more to be done at the points of intersection with other linguistic processing layers, in particular, at the interface of UD with external morphological and lexical resources. inproceedings sade-seker-tsarfaty:2018:UDW2018 Multi-source synthetic treebank creation for improved cross-lingual dependency parsing FrancisTyers MariyaSheyanova AleksandraMartynova PavelStepachev KonstantinVinogorodskiy Proceedings of the Second Workshop on Universal Dependencies (UDW 2018) November 2018

Brussels, Belgium

Association for Computational Linguistics 144–150 http://www.aclweb.org/anthology/W18-6017 This paper describes a method of creating synthetic treebanks for inproceedings tyers-EtAl:2018:UDW2018 Toward Universal Dependencies for Shipibo-Konibo AlonsoVásquez RenzoEgo Aguirre CandyAngulo JohnMiller ClaudiaVillanueva ŽeljkoAgić RobertoZariquiey ArturoOncevay Proceedings of the Second Workshop on Universal Dependencies (UDW 2018) November 2018

Brussels, Belgium

Association for Computational Linguistics 151–161 http://www.aclweb.org/anthology/W18-6018 We present an initial version of the Universal Dependencies (UD) treebank for Shipibo-Konibo, the first South American, Amazonian, Panoan and Peruvian language with a resource built under UD. We describe the linguistic aspects of how the tagset was defined and the treebank was annotated; in addition we present our specific treatment of linguistic units called clitics. Although the treebank is still under development, it allowed us to perform a typological comparison against Spanish, the predominant language in Peru, and dependency syntax parsing experiments in both monolingual and cross-lingual approaches. inproceedings vsquez-EtAl:2018:UDW2018 Transition-based Parsing with Lighter Feed-Forward Networks DavidVilares CarlosGómez-Rodríguez Proceedings of the Second Workshop on Universal Dependencies (UDW 2018) November 2018

Brussels, Belgium

Association for Computational Linguistics 162–172 http://www.aclweb.org/anthology/W18-6019 We explore whether it is possible to build lighter parsers, that are statistically equivalent to their corresponding standard version, for a wide set of languages showing different structures and morphologies. As testbed, we use the Universal Dependencies and transition-based dependency parsers trained on feed-forward networks. For these, most existing research assumes de facto standard embedded features and relies on pre-computation tricks to obtain speedups. We explore how these features and their size can be reduced and whether this translates into speed-ups with a negligible impact on accuracy. The experiments show that grand-daughter features can be removed for the majority of treebanks without a significant (negative or positive) LAS difference. They also show how the size of the embeddings can be notably reduced. inproceedings vilares-gmezrodrguez:2018:UDW2018 Extended and Enhanced Polish Dependency Bank in Universal Dependencies Format AlinaWróblewska Proceedings of the Second Workshop on Universal Dependencies (UDW 2018) November 2018

Brussels, Belgium

Association for Computational Linguistics 173–182 http://www.aclweb.org/anthology/W18-6020 The paper presents the largest Polish Dependency Bank in Universal Dependencies format – PDBUD – with 22K trees and 352K tokens. PDBUD builds on its previous version, i.e. the Polish UD treebank (PL-SZ), and contains all 8K PL-SZ trees. The PL-SZ trees are checked and possibly corrected in the current edition of PDBUD. Further 14K trees are automatically converted from a new version of Polish Dependency Bank. The PDBUD trees are expanded with the enhanced edges encoding the shared dependents and the shared governors of the coordinated conjuncts and with the semantic roles of some dependents. The conducted evaluation experiments show that PDBUD is large enough for training a high-quality graph-based dependency parser for Polish. inproceedings wrblewska:2018:UDW2018 Approximate Dynamic Oracle for Dependency Parsing with Reinforcement Learning XiangYu Ngoc ThangVu JonasKuhn Proceedings of the Second Workshop on Universal Dependencies (UDW 2018) November 2018

Brussels, Belgium

Association for Computational Linguistics 183–191 http://www.aclweb.org/anthology/W18-6021 We present a general approach with reinforcement learning (RL) to approximate dynamic oracles for transition systems where exact dynamic oracles are difficult to derive. We treat oracle parsing as a reinforcement learning problem, design the reward function inspired by the classical dynamic oracle, and use Deep Q-Learning (DQN) techniques to train the oracle with gold trees as features. The combination of a priori knowledge and data-driven methods enables an efficient dynamic oracle, which improves the parser performance over static oracles in several transition systems. inproceedings yu-vu-kuhn:2018:UDW2018 The Coptic Universal Dependency Treebank AmirZeldes MitchellAbrams Proceedings of the Second Workshop on Universal Dependencies (UDW 2018) November 2018

Brussels, Belgium

Association for Computational Linguistics 192–201 http://www.aclweb.org/anthology/W18-6022 This paper presents the Coptic Universal Dependency Treebank, the first dependency treebank within the Egyptian subfamily of the Afro-Asiatic languages. We discuss the composition of the corpus, challenges in adapting the UD annotation scheme to existing conventions for annotating Coptic, and evaluate inter-annotator agreement on UD annotation for the language. Some specific constructions are taken as a starting point for discussing several more general UD annotation guidelines, in particular for appositions, ambiguous passivization, incorporation and object-doubling. inproceedings zeldes-abrams:2018:UDW2018