Extending the SSJ Universal Dependencies Treebank for Slovenian: Was It Worth It?

Kaja Dobrovoljc; Nikola Ljubešić

Extending the SSJ Universal Dependencies Treebank for Slovenian: Was It Worth It?

Abstract

This paper presents the creation and evaluation of a new version of the reference SSJ Universal Dependencies Treebank for Slovenian, which has been substantially improved and extended to almost double the original size. The process was based on the initial revision and documentation of the language-specific UD annotation guidelines for Slovenian and the corresponding modification of the original SSJ annotations, followed by a two-stage annotation campaign, in which two new subsets have been added, the previously unreleased sentences from the ssj500k corpus and the Slovenian subset of the ELEXIS parallel corpus. The annotation campaign resulted in an extended version of the SSJ UD treebank with 5,435 newly added sentences comprising of 126,427 tokens. To evaluate the potential benefits of this data increase for Slovenian dependency parsing, we compared the performance of the classla-stanza dependency parser trained on the old and the new SSJ data when evaluated on the new SSJ test set and its subsets. Our results show an increase of LAS performance in general, especially for previously under-represented syntactic phenomena, such as lists, elliptical constructions and appositions, but also confirm the distinct nature of the two newly added subsets and the diversification of the SSJ treebank as a whole.

Anthology ID:: 2022.law-1.3
Volume:: Proceedings of the 16th Linguistic Annotation Workshop (LAW-XVI) within LREC2022
Month:: June
Year:: 2022
Address:: Marseille, France
Editors:: Sameer Pradhan, Sandra Kuebler
Venue:: LAW
SIG:: SIGANN
Publisher:: European Language Resources Association
Note:
Pages:: 15–22
Language:
URL:: https://aclanthology.org/2022.law-1.3/
DOI:
Bibkey:
Cite (ACL):: Kaja Dobrovoljc and Nikola Ljubešić. 2022. Extending the SSJ Universal Dependencies Treebank for Slovenian: Was It Worth It?. In Proceedings of the 16th Linguistic Annotation Workshop (LAW-XVI) within LREC2022, pages 15–22, Marseille, France. European Language Resources Association.
Cite (Informal):: Extending the SSJ Universal Dependencies Treebank for Slovenian: Was It Worth It? (Dobrovoljc & Ljubešić, LAW 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.law-1.3.pdf

PDF Cite Search Fix data