Spoken Language Treebanks in Universal Dependencies: an Overview

Kaja Dobrovoljc

Spoken Language Treebanks in Universal Dependencies: an Overview

Abstract

Given the benefits of syntactically annotated collections of transcribed speech in spoken language research and applications, many spoken language treebanks have been developed in the last decades, with divergent annotation schemes posing important limitations to cross-resource explorations, such as comparing data across languages, grammatical frameworks, and language domains. As a consequence, there has been a growing number of spoken language treebanks adopting the Universal Dependencies (UD) annotation scheme, aimed at cross-linguistically consistent morphosyntactic annotation. In view of the non-central role of spoken language data within the scheme and with little in-domain consolidation to date, this paper presents a comparative overview of spoken language treebanks in UD to support cross-treebank data explorations on the one hand, and encourage further treebank harmonization on the other. Our results show that the spoken language treebanks differ considerably with respect to the inventory and the format of transcribed phenomena, as well as the principles adopted in their morphosyntactic annotation. This is particularly true for the dependency annotation of speech disfluencies, where conflicting data annotations suggest an underspecification of the guidelines pertaining to speech repairs in general and the reparandum dependency relation in particular.

Anthology ID:: 2022.lrec-1.191
Volume:: Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:: June
Year:: 2022
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 1798–1806
Language:
URL:: https://aclanthology.org/2022.lrec-1.191/
DOI:
Bibkey:
Cite (ACL):: Kaja Dobrovoljc. 2022. Spoken Language Treebanks in Universal Dependencies: an Overview. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1798–1806, Marseille, France. European Language Resources Association.
Cite (Informal):: Spoken Language Treebanks in Universal Dependencies: an Overview (Dobrovoljc, LREC 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.lrec-1.191.pdf

PDF Cite Search Fix data