Deep Dominance - How to Properly Compare Deep Neural Models

Rotem Dror, Segev Shlomov, Roi Reichart


Abstract
Comparing between Deep Neural Network (DNN) models based on their performance on unseen data is crucial for the progress of the NLP field. However, these models have a large number of hyper-parameters and, being non-convex, their convergence point depends on the random values chosen at initialization and during training. Proper DNN comparison hence requires a comparison between their empirical score distributions on unseen data, rather than between single evaluation scores as is standard for more simple, convex models. In this paper, we propose to adapt to this problem a recently proposed test for the Almost Stochastic Dominance relation between two distributions. We define the criteria for a high quality comparison method between DNNs, and show, both theoretically and through analysis of extensive experimental results with leading DNN models for sequence tagging tasks, that the proposed test meets all criteria while previously proposed methods fail to do so. We hope the test we propose here will set a new working practice in the NLP community.
Anthology ID:
P19-1266
Volume:
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2019
Address:
Florence, Italy
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2773–2785
Language:
URL:
https://aclanthology.org/P19-1266
DOI:
10.18653/v1/P19-1266
Bibkey:
Cite (ACL):
Rotem Dror, Segev Shlomov, and Roi Reichart. 2019. Deep Dominance - How to Properly Compare Deep Neural Models. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2773–2785, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
Deep Dominance - How to Properly Compare Deep Neural Models (Dror et al., ACL 2019)
Copy Citation:
PDF:
https://aclanthology.org/P19-1266.pdf
Video:
 https://vimeo.com/384738204
Code
 rtmdrr/deepComparison