We Need to Talk About Reproducibility in NLP Model Comparison

Yan Xue, Xuefei Cao, Xingli Yang, Yu Wang, Ruibo Wang, Jihong Li


Abstract
NLPers frequently face reproducibility crisis in a comparison of various models of a real-world NLP task. Many studies have empirically showed that the standard splits tend to produce low reproducible and unreliable conclusions, and they attempted to improve the splits by using more random repetitions. However, the improvement on the reproducibility in a comparison of NLP models is limited attributed to a lack of investigation on the relationship between the reproducibility and the estimator induced by a splitting strategy. In this paper, we formulate the reproducibility in a model comparison into a probabilistic function with regard to a conclusion. Furthermore, we theoretically illustrate that the reproducibility is qualitatively dominated by the signal-to-noise ratio (SNR) of a model performance estimator obtained on a corpus splitting strategy. Specifically, a higher value of the SNR of an estimator probably indicates a better reproducibility. On the basis of the theoretical motivations, we develop a novel mixture estimator of the performance of an NLP model with a regularized corpus splitting strategy based on a blocked 3× 2 cross-validation. We conduct numerical experiments on multiple NLP tasks to show that the proposed estimator achieves a high SNR, and it substantially increases the reproducibility. Therefore, we recommend the NLP practitioners to use the proposed method to compare NLP models instead of the methods based on the widely-used standard splits and the random splits with multiple repetitions.
Anthology ID:
2023.emnlp-main.586
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9424–9434
Language:
URL:
https://aclanthology.org/2023.emnlp-main.586
DOI:
10.18653/v1/2023.emnlp-main.586
Bibkey:
Cite (ACL):
Yan Xue, Xuefei Cao, Xingli Yang, Yu Wang, Ruibo Wang, and Jihong Li. 2023. We Need to Talk About Reproducibility in NLP Model Comparison. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9424–9434, Singapore. Association for Computational Linguistics.
Cite (Informal):
We Need to Talk About Reproducibility in NLP Model Comparison (Xue et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.586.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.586.mp4