Investigating UD Treebanks via Dataset Difficulty Measures

Artur Kulmizev, Joakim Nivre


Abstract
Treebanks annotated with Universal Dependencies (UD) are currently available for over 100 languages and are widely utilized by the community. However, their inherent characteristics are hard to measure and are only partially reflected in parser evaluations via accuracy metrics like LAS. In this study, we analyze a large subset of the UD treebanks using three recently proposed accuracy-free dataset analysis methods: dataset cartography, 𝒱-information, and minimum description length. Each method provides insights about UD treebanks that would remain undetected if only LAS was considered. Specifically, we identify a number of treebanks that, despite yielding high LAS, contain very little information that is usable by a parser to surpass what can be achieved by simple heuristics. Furthermore, we make note of several treebanks that score consistently low across numerous metrics, indicating a high degree of noise or annotation inconsistency present therein.
Anthology ID:
2023.eacl-main.76
Volume:
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
Month:
May
Year:
2023
Address:
Dubrovnik, Croatia
Editors:
Andreas Vlachos, Isabelle Augenstein
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1076–1089
Language:
URL:
https://aclanthology.org/2023.eacl-main.76
DOI:
10.18653/v1/2023.eacl-main.76
Bibkey:
Cite (ACL):
Artur Kulmizev and Joakim Nivre. 2023. Investigating UD Treebanks via Dataset Difficulty Measures. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1076–1089, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):
Investigating UD Treebanks via Dataset Difficulty Measures (Kulmizev & Nivre, EACL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.eacl-main.76.pdf
Video:
 https://aclanthology.org/2023.eacl-main.76.mp4