Measuring Robustness for NLP

Yu Yu, Abdul Rafae Khan, Jia Xu


Abstract
The quality of Natural Language Processing (NLP) models is typically measured by the accuracy or error rate of a predefined test set. Because the evaluation and optimization of these measures are narrowed down to a specific domain like news and cannot be generalized to other domains like Twitter, we often observe that a system reported with human parity results generates surprising errors in real-life use scenarios. We address this weakness with a new approach that uses an NLP quality measure based on robustness. Unlike previous work that has defined robustness using Minimax to bound worst cases, we measure robustness based on the consistency of cross-domain accuracy and introduce the coefficient of variation and (epsilon, gamma)-Robustness. Our measures demonstrate higher agreements with human evaluation than accuracy scores like BLEU on ranking Machine Translation (MT) systems. Our experiments of sentiment analysis and MT tasks show that incorporating our robustness measures into learning objectives significantly enhances the final NLP prediction accuracy over various domains, such as biomedical and social media.
Anthology ID:
2022.coling-1.343
Volume:
Proceedings of the 29th International Conference on Computational Linguistics
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
3908–3916
Language:
URL:
https://aclanthology.org/2022.coling-1.343
DOI:
Bibkey:
Cite (ACL):
Yu Yu, Abdul Rafae Khan, and Jia Xu. 2022. Measuring Robustness for NLP. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3908–3916, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):
Measuring Robustness for NLP (Yu et al., COLING 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.coling-1.343.pdf