Yaozong Shen
2023
IAEval: A Comprehensive Evaluation of Instance Attribution on Natural Language Understanding
Peijian Gu
|
Yaozong Shen
|
Lijie Wang
|
Quan Wang
|
Hua Wu
|
Zhendong Mao
Findings of the Association for Computational Linguistics: EMNLP 2023
Instance attribution (IA) aims to identify the training instances leading to the prediction of a test example, helping researchers understand the dataset better and optimize data processing. While many IA methods have been proposed recently, how to evaluate them still remains open. Previous evaluations of IA only focus on one or two dimensions and are not comprehensive. In this work, we introduce IAEval for IA methods, a systematic and comprehensive evaluation scheme covering four significant requirements: sufficiency, completeness, stability and plausibility. We elaborately design novel metrics to measure these requirements for the first time. Three representative IA methods are evaluated under IAEval on four natural language understanding datasets. Extensive experiments confirmed the effectiveness of IAEval and exhibited its ability to provide comprehensive comparison among IA methods. With IAEval, researchers can choose the most suitable IA methods for applications like model debugging.
2022
A Fine-grained Interpretability Evaluation Benchmark for Neural NLP
Lijie Wang
|
Yaozong Shen
|
Shuyuan Peng
|
Shuai Zhang
|
Xinyan Xiao
|
Hao Liu
|
Hongxuan Tang
|
Ying Chen
|
Hua Wu
|
Haifeng Wang
Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL)
While there is increasing concern about the interpretability of neural models, the evaluation of interpretability remains an open problem, due to the lack of proper evaluation datasets and metrics. In this paper, we present a novel benchmark to evaluate the interpretability of both neural models and saliency methods. This benchmark covers three representative NLP tasks: sentiment analysis, textual similarity and reading comprehension, each provided with both English and Chinese annotated data. In order to precisely evaluate the interpretability, we provide token-level rationales that are carefully annotated to be sufficient, compact and comprehensive. We also design a new metric, i.e., the consistency between the rationales before and after perturbations, to uniformly evaluate the interpretability on different types of tasks. Based on this benchmark, we conduct experiments on three typical models with three saliency methods, and unveil their strengths and weakness in terms of interpretability. We will release this benchmark (https://www.luge.ai/#/luge/task/taskDetail?taskId=15) and hope it can facilitate the research in building trustworthy systems.