International Workshop on Treebanks and Linguistic Theories (2024)


up

pdf (full)
bib (full)
Proceedings of the 22nd Workshop on Treebanks and Linguistic Theories (TLT 2024)

pdf bib
Proceedings of the 22nd Workshop on Treebanks and Linguistic Theories (TLT 2024)
Daniel Dakota | Sarah Jablotschkin | Sandra Kübler | Heike Zinsmeister

pdf bib
Developing the Egyptian-UJaen Treebank
Roberto Antonio Díaz Hernández | Marco Carlo Passarotti

This paper presents preliminary results of the development of the Egyptian-UJaen treebank, the first dependency treebank created for pre-Coptic Egyptian in Universal Dependencies. It describes the current state of the treebank, explains the approach adopted for the morphosyntactic annotation and discusses some issues concerning the adoption of the CoNLL-U format for the annotation of Egyptian texts. This treebank will surely become a useful linguistic tool for understanding the synchronic and dia- chronic use of pre-Coptic Egyptian.

pdf bib
Symmetric Dependency Structure of Coordination: Crosslinguistic Arguments from Dependency Length Minimization
Adam Przepiórkowski Przepiórkowski | Magdalena Borysiak | Adam Okrasiński | Bartosz Pobożniak | Wojciech Stempniak | Kamil Tomaszek | Adam Głowacki

The aim of this paper is to replicate and extend recent treebank-based considerations regarding the syntactic structure of coordination. Overall, we confirm the previous results that, given the principle of Dependency Length Minimization, corpus data suggest that the structure of coordination is symmetric. While previous work was based on 2 English datasets, we extend the investigation to 3 more English datasets, 3 Polish datasets, and UD corpora for a number of diverse languages. The results confirm the symmetric structure of coordination, but they also make it possible to question some of the previous findings regarding the exact symmetric structure of coordination.

pdf bib
A First Look at the Ugaritic Poetic Text Corpus
Tillmann Dönicke | Clemens Steinberger | Max-Ferdinand Zeterberg | Noah Krill

For the Ugaritic poetic texts there is currently no digital corpus including extensive philological and poetological annotations. Within the research project “Edition des ugaritischen poetischen Textkorpus” (EUPT), these texts are digitised and provided as an online-accessible corpus. This paper briefly introduces the project and outlines the principles of the data model. The focus is on the different annotation levels and their connection with each other.

pdf bib
LuxBank: The First Universal Dependency Treebank for Luxembourgish
Alistair Plum | Caroline Döhmer | Emilia Milano | Anne-Marie Lutgen | Christoph Purschke

The Universal Dependencies (UD) project has significantly expanded linguistic coverage across 161 languages, yet Luxembourgish, a West Germanic language spoken by approximately 400,000 people, has remained absent until now. In this paper, we introduce LuxBank, the first UD Treebank for Luxembourgish, addressing the gap in syntactic annotation and analysis for this ‘low-research’ language. We establish formal guidelines for Luxembourgish language annotation, providing the foundation for the first large-scale quantitative analysis ofits syntax. LuxBank serves not only as a resource for linguists and language learners but also as a tool for developing spell checkers and grammar checkers, organising existing text archives and even training large language models. By incorporating Luxembourgish into the UD framework, we aim to enhance the understanding of syntactic variation within West Germanic languages and offer a model for documenting smaller, semi-standardised languages. This work positions Luxembourgish as a valuable resource in the broader linguistic and NLP communities, contributing to the study of languages with limited research and resources.

pdf bib
Building a Universal Dependencies Treebank for Georgian
Irina Lobzhanidze | Erekle Magradze | Svetlana Berikashvili | Anzor Gozalishvili | Tamar Jalaghonia

This paper presents the design and development of the Georgian Syntactic Treebank within the Universal Dependencies (UD) framework, addressing the unique morphosyntactic challenges ofGeorgian, a Kartvelian language. We describe the methodology for selecting andannotating 3,013 sentences from Wiki, mapping existing tagsets to the UD scheme, and converting data into the CoNLL-U format. The paper also details the training of a UDPipe model using this preliminary treebank.

pdf bib
Introducing Shallow Syntactic Information within the Graph-based Dependency Parsing
Nikolay Paev | Kiril Simov | Petya Osenova

The paper presents a new BERT model, fine-tuned for parsing of Bulgarian texts. This model is extended with a new neural network layer in order to incorporate shallow syntactic information during the training phase. The results show statistically significant improvement over the baseline. Thus, the addition of syntactic knowledge - even partial - makes the model better. Also, some error analysis has been conducted on the results from the parsers. Although the architecture has been designed and tested for Bulgarian, it is also scalable for other languages. This scalability was shown here with some experiments and evaluation on an English treebank with a comparable size.

pdf bib
A Multilingual Parallel Corpus for Coreference Resolution and Information Status in the Literary Domain
Andrew Dyer | Ruveyda Betul Bahceci | Maryam Rajestari | Andreas Rouvalis | Aarushi Singhal | Yuliya Stodolinska | Syahidah Asma Umniyati | Helena Rodrigues Menezes de Oliveira Vaz

Information status — the newness or givenness of referents in discourse — is known to affect the production of language at many different levels. At the morphosyntactic level, information status gives rise to special words orders, elisions, and other phenomena that challenge the notion that morphosyntax can be considered independent of discourse context. Though there are many language-specific corpora annotated for information status and its related phenomena, coreference and anaphora resolution, what is not available at present is a cross-lingually consistently annotated corpus or annotation scheme that would allow for comparativestudy of these phenomena across many diverse languages. In this paper we present our work to build such a resource. We are annotating a parsed, parallel corpus of prose in many languages for information status and coreference resolution, so that like-for-like cross-lingual comparisons can be made at the intersection of discourse and syntax. Our corpus can and will be used bot

pdf bib
Dependency Structure of Coordination in Head-final Languages: a Dependency-Length-Minimization-Based Study
Wojciech Stempniak

There is no single accepted model of the dependency structure of coordination. Universal Dependencies (UD, De Marneffe et al. 2021) enforces in its corpora an asymmetrical model privileging the coordination’s first conjunct as a standard. Kanayama et al. (2018) criticize that approach stating that this model is incompatible with the grammatical structure of head-final languages. Recent research (Przepiórkowski and Woźniak 2023, Przepiórkowski et al. 2024a) provides a DLM-based argument for the symmetrical models of the dependency structure of English coordination. This paper shows the result of the analysis of coordinations found in UD corpora of two head-final languages, namely Korean and Turkish. Based on the analysis of coordinations and theoretical arguments, an alternative approach to the dependency structure of coordination in head-final languages is suggested.