Utilizing Semantic Textual Similarity for Clinical Survey Data Feature Selection

Benjamin C. Warner; Ziqi Xu; Simon Haroutounian; Thomas Kannampallil; Chenyang Lu

doi:10.18653/v1/2025.findings-acl.27

Utilizing Semantic Textual Similarity for Clinical Survey Data Feature Selection

Benjamin C. Warner, Ziqi Xu, Simon Haroutounian, Thomas Kannampallil, Chenyang Lu

Abstract

Surveys are widely used to collect patient data in healthcare, and there is significant clinical interest in predicting patient outcomes using survey data. However, surveys often include numerous features that lead to high-dimensional inputs for machine learning models. This paper exploits a unique source of information in surveys for feature selection. We observe that feature names (i.e., survey questions) are often semantically indicative of what features are most useful. Using language models, we leverage semantic textual similarity (STS) scores between features and targets to select features. The performance of STS scores in directly ranking features as well as in the minimal-redundancy-maximal-relevance (mRMR) algorithm is evaluated using survey data collected as part of a clinical study on persistent post-surgical pain (PPSP) as well as an accessible dataset collected through the NIH All of Us program. Our findings show that features selected with STS can result in higher performance models compared to traditional feature selection algorithms.

Anthology ID:: 2025.findings-acl.27
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 502–520
Language:
URL:: https://aclanthology.org/2025.findings-acl.27/
DOI:: 10.18653/v1/2025.findings-acl.27
Bibkey:
Cite (ACL):: Benjamin C. Warner, Ziqi Xu, Simon Haroutounian, Thomas Kannampallil, and Chenyang Lu. 2025. Utilizing Semantic Textual Similarity for Clinical Survey Data Feature Selection. In Findings of the Association for Computational Linguistics: ACL 2025, pages 502–520, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Utilizing Semantic Textual Similarity for Clinical Survey Data Feature Selection (Warner et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-acl.27.pdf

PDF Cite Search Fix data