Towards Realistic Practices In Low-Resource Natural Language Processing: The Development Set

Katharina Kann, Kyunghyun Cho, Samuel R. Bowman


Abstract
Development sets are impractical to obtain for real low-resource languages, since using all available data for training is often more effective. However, development sets are widely used in research papers that purport to deal with low-resource natural language processing (NLP). Here, we aim to answer the following questions: Does using a development set for early stopping in the low-resource setting influence results as compared to a more realistic alternative, where the number of training epochs is tuned on development languages? And does it lead to overestimation or underestimation of performance? We repeat multiple experiments from recent work on neural models for low-resource NLP and compare results for models obtained by training with and without development sets. On average over languages, absolute accuracy differs by up to 1.4%. However, for some languages and tasks, differences are as big as 18.0% accuracy. Our results highlight the importance of realistic experimental setups in the publication of low-resource NLP research results.
Anthology ID:
D19-1329
Volume:
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
Month:
November
Year:
2019
Address:
Hong Kong, China
Editors:
Kentaro Inui, Jing Jiang, Vincent Ng, Xiaojun Wan
Venues:
EMNLP | IJCNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
3342–3349
Language:
URL:
https://aclanthology.org/D19-1329
DOI:
10.18653/v1/D19-1329
Bibkey:
Cite (ACL):
Katharina Kann, Kyunghyun Cho, and Samuel R. Bowman. 2019. Towards Realistic Practices In Low-Resource Natural Language Processing: The Development Set. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3342–3349, Hong Kong, China. Association for Computational Linguistics.
Cite (Informal):
Towards Realistic Practices In Low-Resource Natural Language Processing: The Development Set (Kann et al., EMNLP-IJCNLP 2019)
Copy Citation:
PDF:
https://aclanthology.org/D19-1329.pdf
Attachment:
 D19-1329.Attachment.zip