Estimating Semantic Similarity between In-Domain and Out-of-Domain Samples

Rhitabrat Pokharel, Ameeta Agrawal


Abstract
Prior work typically describes out-of-domain (OOD) or out-of-distribution (OODist) samples as those that originate from dataset(s) or source(s) different from the training set but for the same task. When compared to in-domain (ID) samples, the models have been known to usually perform poorer on OOD samples, although this observation is not consistent. Another thread of research has focused on OOD detection, albeit mostly using supervised approaches. In this work, we first consolidate and present a systematic analysis of multiple definitions of OOD and OODist as discussed in prior literature. Then, we analyze the performance of a model under ID and OOD/OODist settings in a principled way. Finally, we seek to identify an unsupervised method for reliably identifying OOD/OODist samples without using a trained model. The results of our extensive evaluation using 12 datasets from 4 different tasks suggest the promising potential of unsupervised metrics in this task.
Anthology ID:
2023.starsem-1.35
Volume:
Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Alexis Palmer, Jose Camacho-collados
Venue:
*SEM
SIG:
SIGLEX
Publisher:
Association for Computational Linguistics
Note:
Pages:
409–416
Language:
URL:
https://aclanthology.org/2023.starsem-1.35
DOI:
10.18653/v1/2023.starsem-1.35
Bibkey:
Cite (ACL):
Rhitabrat Pokharel and Ameeta Agrawal. 2023. Estimating Semantic Similarity between In-Domain and Out-of-Domain Samples. In Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023), pages 409–416, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Estimating Semantic Similarity between In-Domain and Out-of-Domain Samples (Pokharel & Agrawal, *SEM 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.starsem-1.35.pdf