Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)
We present the first Universal Dependencies treebank for Hittite. This paper expands on earlier efforts at Hittite corpus creation (Molina and Molin, 2016; Molina, 2016) and discussions of annotation guidelines for Hittite within the UD framework (Inglese, 2015; Inglese et al., 2018). We build on the expertise of the above works to create a small corpus which we hope will serve as a stepping-stone to more expansive UD treebanking for Hittite.
Cross-lingual and multilingual methods have been widely suggested as options for dependency parsing of low-resource languages; however, these typically require the use of annotated data in related high-resource languages. In this paper, we evaluate the performance of these methods versus monolingual parsing of Tagalog, an Austronesian language which shares little typological similarity with any existing high-resource languages. We show that a monolingual model developed on minimal target language data consistently outperforms all cross-lingual and multilingual models when no closely-related sources exist for a low-resource language.
A Universal Dependencies Conversion Pipeline for a Penn-format Constituency Treebank
Þórunn Arnardóttir | Hinrik Hafsteinsson | Einar Freyr Sigurðsson | Kristín Bjarnadóttir | Anton Karl Ingason | Hildur Jónsdóttir | Steinþór Steingrímsson
The topic of this paper is a rule-based pipeline for converting constituency treebanks based on the Penn Treebank format to Universal Dependencies (UD). We describe an Icelandic constituency treebank, its annotation scheme and the UD scheme. The conversion is discussed, the methods used to deliver a fully automated UD corpus and complications involved. To show its applicability to corpora in different languages, we extend the pipeline and convert a Faroese constituency treebank to a UD corpus. The result is an open-source conversion tool, published under an Apache 2.0 license, applicable to a Penn-style treebank for conversion to a UD corpus, along with the two new UD corpora.
We use Universal Dependencies treebanks to test whether a well-known typological trade-off between word order freedom and richness of morphological marking of core arguments holds within individual languages. Using Russian and German treebank data, we show that the following phenomenon (sometimes dubbed word order freezing) does occur: those sentences where core arguments cannot be distinguished by morphological means (due to case syncretism or other kinds of ambiguity) have more rigid order of subject, verb and object than those where unambiguous morphological marking is present. In ambiguous clauses, word order is more often equal to the one which is default or dominant (most frequent) in the language. While Russian and German differ with respect to how exactly they mark core arguments, the effect of morphological ambiguity is significant in both languages. It is, however, small, suggesting that languages do adapt to the evolutionary pressure on communicative efficiency and avoidance of redundancy, but that the pressure is weak in this particular respect.
This article considers the annotation of subjects in UD treebanks. The identification of the subject poses a particular problem in Wolof, due to pronominal indices whose status as a pronoun or a pronominal affix is uncertain. In the UD treebank available for Wolof (Dione, 2019), these have been annotated depending on the construction either as true subjects, or as morphosyntactic features agreeing with the verb. The study of this corpus of 40 000 words allows us to show that the problem is indeed difficult to solve, especially since Wolof has a rich system of auxiliaries and several basic constructions with different properties. Before addressing the case of Wolof, we will present the simpler, but partly comparable, case of French, where subject clitics also tend to behave like affixes, and subjecthood can move from the preverbal to the detached position. We will also make a several annotation recommendations that would avoid overwriting information regarding subjecthood.
As in any field of inquiry that depends on experiments, the verifiability of experimental studies is important in computational linguistics. Despite increased attention to verification of empirical results, the practices in the field are unclear. Furthermore, we argue, certain traditions and practices that are seemingly useful for verification may in fact be counterproductive. We demonstrate this through a set of multi-lingual experiments on parsing Universal Dependencies treebanks. In particular, we show that emphasis on exact replication leads to practices (some of which are now well established) that hide the variation in experimental results, effectively hindering verifiability with a false sense of certainty. The purpose of the present paper is to highlight the magnitude of the issues resulting from these common practices with the hope of instigating further discussion. Once we, as a community, are convinced about the importance of the problems, the solutions are rather obvious, although not necessarily easy to implement.
This paper reports on a systematic approach for deriving Universal Dependencies from LFG structures. The conversion starts with a step-wise transformation of the c-structure, combining part-of-speech (POS) information and the embedding path to determine the true head of dependency structures. The paper discusses several issues faced by existing algorithms when applied on Wolof and presents the strategies used to account for these issues. An experimental evaluation indicated that our approach was able to generate the correct output in more than 90% of the cases, leading to a substantial improvement in conversion accuracy compared to the previous models.
The Universal Dependencies treebanks are a still-growing collection of treebanks for a wide range of languages, all annotated with a common inventory of dependency relations. Yet, the usages of the relations can be categorically different even for treebanks of the same language. We present a pilot study on identifying such inconsistencies in a language-independent way and conduct an experiment which illustrates that a proper handling of inconsistencies can improve parsing performance by several percentage points.
Byte-pair encodings is a method for splitting a word into sub-word tokens, a language model then assigns contextual representations separately to each of these tokens. In this paper, we evaluate four different methods of composing such sub-word representations into word representations. We evaluate the methods on morphological sequence classification, the task of predicting grammatical features of a word. Our experiments reveal that using an RNN to compute word representations is consistently more effective than the other methods tested across a sample of eight languages with different typology and varying numbers of byte-pair tokens per word.
We revisit the problem of extracting dependency structures from the derivation structures of Combinatory Categorial Grammar (CCG). Previous approaches are often restricted to a narrow subset of CCG or support only one flavor of dependency tree. Our approach is more general and easily configurable, so that multiple styles of dependency tree can be obtained. In an initial case study, we show promising results for converting English, German, Italian, and Dutch CCG derivations from the Parallel Meaning Bank into (unlabeled) UD-style dependency trees.
HDT-UD, the largest German UD treebank by a large margin, as well as the German-LIT treebank, currently do not analyze preposition-determiner contractions such as zum (= zu dem, “to the”) as multi-word tokens, which is inconsistent both with UD guidelines as well as other German UD corpora (GSD and PUD). In this paper, we show that harmonizing corpora with regard to this highly frequent phenomenon using a lookup-table based approach leads to a considerable increase in automatic parsing performance.
To investigate issues that arise in the process of developing a Universal Dependency (UD) treebank for Korean and Japanese, we begin by addressing the typological characteristics of Korean and Japanese. Both Korean and Japanese are agglutinative and head-final languages. And the principle of word segmentation for both languages is different from English, which makes it difficult to apply UD guidelines. Following the typological characteristics of the two languages and the issue of UD application, we review the application of UPOS and DEPREL schemes to the two languages. The annotation principles for AUX, ADJ, DET, ADP and PART are discussed for the UPOS scheme, and the annotation principles for case, aux, iobj, and obl are discussed for the DEPREL scheme.
We report on an application of universal dependencies for the study of diachronic shifts in syntactic usage patterns. Our focus is on the evolution of Scientific English in the Late Modern English period (ca. 1700-1900). Our data set is the Royal Society Corpus (RSC), comprising the full set of publications of the Royal Society of London between 1665 and 1996. Our starting assumption is that over time, Scientific English develops specific syntactic choice preferences that increase efficiency in (expert-to-expert) communication. The specific hypothesis we pursue in this paper is that changing syntactic choice preferences lead to greater dependency locality/dependency length minimization, which is associated with positive effects for the efficiency of human as well as computational linguistic processing. As a basis for our measurements, we parsed the RSC using Stanford CoreNLP. Overall, we observe a decrease in dependency length, with long dependency structures becoming less frequent and short dependency structures becoming more frequent over time, notably pertaining to the nominal phrase, thus marking an overall push towards greater communicative efficiency.
UDon2 is an open-source library for manipulating dependency trees represented in the CoNLL-U format. The library is compatible with the Universal Dependencies. UDon2 is aimed at developers of downstream Natural Language Processing applications that require manipulating dependency trees on the sentence level (in addition to other available tools geared towards working with treebanks).
This paper reports on the analysis and annotation of Multiword Expressions in the Irish Universal Dependency Treebank. We provide a linguistic discussion around decisions on how to appropri- ately label Irish MWEs using the compound, flat and fixed dependency relation labels within the framework of the Universal Dependencies annotation guidelines. We discuss some nuances of the Irish language that pose challenges for assigning these UD labels and provide this report in support of the Irish UD annotation guidelines. With this we hope to ensure consistency in annotation across the dataset and provide a basis for future MWE annotation for Irish.
The UD framework defines guidelines for a crosslingual syntactic analysis in the framework of dependency grammar, with the aim of providing a consistent treatment across languages that not only supports multilingual NLP applications but also facilitates typological studies. Until now, the UD framework has mostly focussed on bilexical grammatical relations. In the paper, we propose to add a constructional perspective and discuss several examples of spoken-language constructions that occur in multiple languages and challenge the current use of basic and enhanced UD relations. The examples include cases where the surface relations are deceptive, and syntactic amalgams that either involve unconnected subtrees or structures with multiply-headed dependents. We argue that a unified treatment of constructions across languages will increase the consistency of the UD annotations and thus the quality of the treebanks for linguistic analysis.
Manx Gaelic is one of the three Q-Celtic languages, along with Irish and Scottish Gaelic. We present a new dependency treebank for Manx consisting of 291 sentences and about 6000 tokens, annotated according to the Universal Dependency (UD) guidelines. To the best of our knowledge, this is the first annotated corpus of any kind for Manx. Our annotations generally follow the conventions established by the existing UD treebanks for Irish and Scottish Gaelic, although we highlight some areas where the grammar of Manx diverges, requiring new analyses. We use 10-fold cross validation to evaluate the accuracy of dependency parsers trained on the corpus, and compare these results with delexicalised models transferred from Irish and Scottish Gaelic.
In this paper we present a method for identifying and analyzing adnominal possessive constructions in 66 Universal Dependencies treebanks. We classify adpossessive constructions in terms of their morphological type (locus of marking) and present a workflow for detecting and analyzing them typologically. Based on a preliminary evaluation, the algorithm works fairly reliably in adpossessive constructions that are morphologically marked. However, it performs rather poorly in adpossessive constructions that are not marked morphologically, so-called zero-marked constructions, because of difficulties in identifying these constructions with the current annotation. We also discuss different types of variation in annotation in different treebanks for the same language and for treebanks of closely related languages. The research focuses on one well-circumscribed and universal construction in the hope of generating more interest in using UD for cross-linguistic comparison and for contributing towards developing yet more consistent annotation of constructions in the UD annotation scheme.
We present PALMYRA 2.0, a graphical dependency-tree visualization and editing software. PALMYRA 2.0 is designed to be highly configurable to any dependency parsing representation, and to enable the annotation of a multitude of linguistic features. It uses an intuitive interface that relies on drag-and-drop utilities as well as pop-up menus and keyboard shortcuts that can be easily specified.
In this paper, we introduce the first Universal Dependencies (UD) treebank for standard Albanian, consisting of 60 sentences collected from the Albanian Wikipedia, annotated with lemmas, universal part-of-speech tags, morphological features and syntactic dependencies. In addition to presenting the treebank itself, we discuss a selection of linguistic constructions in Albanian whose analysis in UD is not self-evident, including core arguments and the status of indirect objects, pronominal clitics, genitive constructions, prearticulated adjectives, and modal verbs.
This paper presents the first treebank for the Laz language, which is also the first Universal Dependencies Treebank for a South Caucasian language. This treebank aims to create a syntactically and morphologically annotated resource for further research. We also aim to document an endangered language in a systematic fashion within an inherently cross-linguistic framework: the Universal Dependencies Project (UD). As of now, our treebank consists of 576 sentences and 2,306 tokens annotated in light with the UD guidelines. We evaluated the treebank on the dependency parsing task using a pretrained multilingual parsing model, and the results are comparable with other low-resourced treebanks with no training set. We aim to expand our treebank in the near future to include 1,500 sentences. The bigger goal for our project is to create a set of treebanks for minority languages in Anatolia.
This paper describes an approach to annotating noun incorporation in Universal Dependencies. It motivates the need to annotate this particular morphosyntactic phenomenon and justifies it with respect to frequency of the construction. A case study is presented in which the proposed annotation scheme is applied to Chukchi, a language that exhibits noun incorporation. We compare argument encoding in Chukchi, English and Russian and find that while in English and Russian discourse elements are primarily tracked through noun phrases and pronouns, in Chukchi they are tracked through agreement marking and incorporation, with a lesser role for noun phrases.
We present our work of constructing the first treebank for the Xibe language following the Universal Dependencies (UD) annotation scheme. Xibe is a low-resourced and severely endangered Tungusic language spoken by the Xibe minority living in the Xinjiang Uygur Autonomous Region of China. We collected 810 sentences so far, including 544 sentences from a grammar book on written Xibe and 266 sentences from Cabcal News. We annotated those sentences manually from scratch. In this paper, we report the procedure of building this treebank and analyze several important annotation issues of our treebank. Finally, we propose our plans for future work.