Learning from Unlabelled Data for Clinical Semantic Textual Similarity

Yuxia Wang, Karin Verspoor, Timothy Baldwin


Abstract
Domain pretraining followed by task fine-tuning has become the standard paradigm for NLP tasks, but requires in-domain labelled data for task fine-tuning. To overcome this, we propose to utilise domain unlabelled data by assigning pseudo labels from a general model. We evaluate the approach on two clinical STS datasets, and achieve r= 0.80 on N2C2-STS. Further investigation reveals that if the data distribution of unlabelled sentence pairs is closer to the test data, we can obtain better performance. By leveraging a large general-purpose STS dataset and small-scale in-domain training data, we obtain further improvements to r= 0.90, a new SOTA.
Anthology ID:
2020.clinicalnlp-1.25
Volume:
Proceedings of the 3rd Clinical Natural Language Processing Workshop
Month:
November
Year:
2020
Address:
Online
Editors:
Anna Rumshisky, Kirk Roberts, Steven Bethard, Tristan Naumann
Venue:
ClinicalNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
227–233
Language:
URL:
https://aclanthology.org/2020.clinicalnlp-1.25
DOI:
10.18653/v1/2020.clinicalnlp-1.25
Bibkey:
Cite (ACL):
Yuxia Wang, Karin Verspoor, and Timothy Baldwin. 2020. Learning from Unlabelled Data for Clinical Semantic Textual Similarity. In Proceedings of the 3rd Clinical Natural Language Processing Workshop, pages 227–233, Online. Association for Computational Linguistics.
Cite (Informal):
Learning from Unlabelled Data for Clinical Semantic Textual Similarity (Wang et al., ClinicalNLP 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.clinicalnlp-1.25.pdf
Video:
 https://slideslive.com/38939835