DriftWatch: A Tool that Automatically Detects Data Drift and Extracts Representative Examples Affected by Drift

Myeongjun Jang, Antonios Georgiadis, Yiyun Zhao, Fran Silavong


Abstract
Data drift, which denotes a misalignment between the distribution of reference (i.e., training) and production data, constitutes a significant challenge for AI applications, as it undermines the generalisation capacity of machine learning (ML) models. Therefore, it is imperative to proactively identify data drift before users meet with performance degradation. Moreover, to ensure the successful execution of AI services, endeavours should be directed not only toward detecting the occurrence of drift but also toward effectively addressing this challenge. % considering the limited resources prevalent in practical industrial domains. In this work, we introduce a tool designed to detect data drift in text data. In addition, we propose an unsupervised sampling technique for extracting representative examples from drifted instances. This approach bestows a practical advantage by significantly reducing expenses associated with annotating the labels for drifted instances, an essential prerequisite for retraining the model to sustain its performance on production data.
Anthology ID:
2024.naacl-industry.28
Volume:
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Yi Yang, Aida Davani, Avi Sil, Anoop Kumar
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
335–346
Language:
URL:
https://aclanthology.org/2024.naacl-industry.28
DOI:
10.18653/v1/2024.naacl-industry.28
Bibkey:
Cite (ACL):
Myeongjun Jang, Antonios Georgiadis, Yiyun Zhao, and Fran Silavong. 2024. DriftWatch: A Tool that Automatically Detects Data Drift and Extracts Representative Examples Affected by Drift. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), pages 335–346, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
DriftWatch: A Tool that Automatically Detects Data Drift and Extracts Representative Examples Affected by Drift (Jang et al., NAACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.naacl-industry.28.pdf