Zero-Shot Data Maps. Efficient Dataset Cartography Without Model Training

Angelo Basile, Marc Franco-Salvador, Paolo Rosso


Abstract
Data Maps (Swayamdipta, et al. 2020) have emerged as a powerful tool for diagnosing large annotated datasets. Given a model fitted on a dataset, these maps show each data instance from the dataset in a 2-dimensional space defined by a) the model’s confidence in the true class and b) the variability of this confidence. In previous work, confidence and variability are usually computed using training dynamics, which requires the fitting of a strong model to the dataset. In this paper, we introduce a novel approach: Zero-Shot Data Maps based on fast bi-encoder networks. For each data point, confidence on the true label and variability are computed over the members of an ensemble of zero-shot models constructed with different — but semantically equivalent — label descriptions, i.e., textual representations of each class in a given label space. We conduct a comparative analysis of maps compiled using traditional training dynamics and our proposed zero-shot models across various datasets. Our findings reveal that Zero-Shot Data Maps generally match those produced by the traditional method while delivering up to a 14x speedup. The code is available [here](https://github.com/symanto-research/zeroshot-cartography).
Anthology ID:
2023.findings-emnlp.554
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8264–8277
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.554
DOI:
10.18653/v1/2023.findings-emnlp.554
Bibkey:
Cite (ACL):
Angelo Basile, Marc Franco-Salvador, and Paolo Rosso. 2023. Zero-Shot Data Maps. Efficient Dataset Cartography Without Model Training. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8264–8277, Singapore. Association for Computational Linguistics.
Cite (Informal):
Zero-Shot Data Maps. Efficient Dataset Cartography Without Model Training (Basile et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-emnlp.554.pdf