What Clued the AI Doctor In? On the Influence of Data Source and Quality for Transformer-Based Medical Self-Disclosure Detection

Mina Valizadeh; Xing Qian; Pardis Ranjbar-Noiey; Cornelia Caragea; Natalie Parde

doi:10.18653/v1/2023.eacl-main.86

What Clued the AI Doctor In? On the Influence of Data Source and Quality for Transformer-Based Medical Self-Disclosure Detection

Mina Valizadeh, Xing Qian, Pardis Ranjbar-Noiey, Cornelia Caragea, Natalie Parde

Abstract

Recognizing medical self-disclosure is important in many healthcare contexts, but it has been under-explored by the NLP community. We conduct a three-pronged investigation of this task. We (1) manually expand and refine the only existing medical self-disclosure corpus, resulting in a new, publicly available dataset of 3,919 social media posts with clinically validated labels and high compatibility with the existing task-specific protocol. We also (2) study the merits of pretraining task domain and text style by comparing Transformer-based models for this task, pretrained from general, medical, and social media sources. Our BERTweet condition outperforms the existing state of the art for this task by a relative F1 score increase of 16.73%. Finally, we (3) compare data augmentation techniques for this task, to assess the extent to which medical self-disclosure data may be further synthetically expanded. We discover that this task poses many challenges for data augmentation techniques, and we provide an in-depth analysis of identified trends.

Anthology ID:: 2023.eacl-main.86
Volume:: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
Month:: May
Year:: 2023
Address:: Dubrovnik, Croatia
Editors:: Andreas Vlachos, Isabelle Augenstein
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1201–1216
Language:
URL:: https://aclanthology.org/2023.eacl-main.86
DOI:: 10.18653/v1/2023.eacl-main.86
Bibkey:
Cite (ACL):: Mina Valizadeh, Xing Qian, Pardis Ranjbar-Noiey, Cornelia Caragea, and Natalie Parde. 2023. What Clued the AI Doctor In? On the Influence of Data Source and Quality for Transformer-Based Medical Self-Disclosure Detection. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1201–1216, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):: What Clued the AI Doctor In? On the Influence of Data Source and Quality for Transformer-Based Medical Self-Disclosure Detection (Valizadeh et al., EACL 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.eacl-main.86.pdf
Dataset:: 2023.eacl-main.86.dataset.zip
Video:: https://aclanthology.org/2023.eacl-main.86.mp4

PDF Cite Search Dataset Video