A Universal Dependencies Conversion Pipeline for a Penn-format Constituency Treebank

Þórunn Arnardóttir, Hinrik Hafsteinsson, Einar Freyr Sigurðsson, Kristín Bjarnadóttir, Anton Karl Ingason, Hildur Jónsdóttir, Steinþór Steingrímsson


Abstract
The topic of this paper is a rule-based pipeline for converting constituency treebanks based on the Penn Treebank format to Universal Dependencies (UD). We describe an Icelandic constituency treebank, its annotation scheme and the UD scheme. The conversion is discussed, the methods used to deliver a fully automated UD corpus and complications involved. To show its applicability to corpora in different languages, we extend the pipeline and convert a Faroese constituency treebank to a UD corpus. The result is an open-source conversion tool, published under an Apache 2.0 license, applicable to a Penn-style treebank for conversion to a UD corpus, along with the two new UD corpora.
Anthology ID:
2020.udw-1.3
Volume:
Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Editors:
Marie-Catherine de Marneffe, Miryam de Lhoneux, Joakim Nivre, Sebastian Schuster
Venue:
UDW
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
16–25
Language:
URL:
https://aclanthology.org/2020.udw-1.3
DOI:
Bibkey:
Cite (ACL):
Þórunn Arnardóttir, Hinrik Hafsteinsson, Einar Freyr Sigurðsson, Kristín Bjarnadóttir, Anton Karl Ingason, Hildur Jónsdóttir, and Steinþór Steingrímsson. 2020. A Universal Dependencies Conversion Pipeline for a Penn-format Constituency Treebank. In Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020), pages 16–25, Barcelona, Spain (Online). Association for Computational Linguistics.
Cite (Informal):
A Universal Dependencies Conversion Pipeline for a Penn-format Constituency Treebank (Arnardóttir et al., UDW 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.udw-1.3.pdf
Data
Universal Dependencies