The Fragility of Multi-Treebank Parsing Evaluation

Iago Alonso-Alonso, David Vilares, Carlos Gómez-Rodríguez


Abstract
Treebank selection for parsing evaluation and the spurious effects that might arise from a biased choice have not been explored in detail. This paper studies how evaluating on a single subset of treebanks can lead to weak conclusions. First, we take a few contrasting parsers, and run them on subsets of treebanks proposed in previous work, whose use was justified (or not) on criteria such as typology or data scarcity. Second, we run a large-scale version of this experiment, create vast amounts of random subsets of treebanks, and compare on them many parsers whose scores are available. The results show substantial variability across subsets and that although establishing guidelines for good treebank selection is hard, some inadequate strategies can be easily avoided.
Anthology ID:
2022.coling-1.475
Volume:
Proceedings of the 29th International Conference on Computational Linguistics
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
5345–5359
Language:
URL:
https://aclanthology.org/2022.coling-1.475
DOI:
Bibkey:
Cite (ACL):
Iago Alonso-Alonso, David Vilares, and Carlos Gómez-Rodríguez. 2022. The Fragility of Multi-Treebank Parsing Evaluation. In Proceedings of the 29th International Conference on Computational Linguistics, pages 5345–5359, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):
The Fragility of Multi-Treebank Parsing Evaluation (Alonso-Alonso et al., COLING 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.coling-1.475.pdf
Code
 minionattack/fragility_coling_2022