The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing

Rotem Dror, Gili Baumer, Segev Shlomov, Roi Reichart


Abstract
Statistical significance testing is a standard statistical tool designed to ensure that experimental results are not coincidental. In this opinion/ theoretical paper we discuss the role of statistical significance testing in Natural Language Processing (NLP) research. We establish the fundamental concepts of significance testing and discuss the specific aspects of NLP tasks, experimental setups and evaluation measures that affect the choice of significance tests in NLP research. Based on this discussion we propose a simple practical protocol for statistical significance test selection in NLP setups and accompany this protocol with a brief survey of the most relevant tests. We then survey recent empirical papers published in ACL and TACL during 2017 and show that while our community assigns great value to experimental results, statistical significance testing is often ignored or misused. We conclude with a brief discussion of open issues that should be properly addressed so that this important tool can be applied. in NLP research in a statistically sound manner.
Anthology ID:
P18-1128
Volume:
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2018
Address:
Melbourne, Australia
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1383–1392
Language:
URL:
https://aclanthology.org/P18-1128
DOI:
10.18653/v1/P18-1128
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/P18-1128.pdf
Video:
 https://vimeo.com/285803636
Presentation:
 P18-1128.Presentation.pdf
Code
 rtmdrr/testSignificanceNLP