Synth-SBDH: A Synthetic Dataset of Social and Behavioral Determinants of Health for Clinical Text

Avijit Mitra, Zhichao Yang, Emily Druhl, Raelene Goodwin, Hong Yu


Abstract
Social and behavioral determinants of health (SBDH) play a crucial role in health outcomes and are frequently documented in clinical text. Automatically extracting SBDH information from clinical text relies on publicly available good-quality datasets. However, existing SBDH datasets exhibit substantial limitations in their availability and coverage. In this study, we introduce Synth-SBDH, a novel synthetic dataset with detailed SBDH annotations, encompassing status, temporal information, and rationale across 15 SBDH categories. We showcase the utility of Synth-SBDH on three tasks using real-world clinical datasets from two distinct hospital settings, highlighting its versatility, generalizability, and distillation capabilities. Models trained on Synth-SBDH consistently outperform counterparts with no Synth-SBDH training, achieving up to 63.75% macro-F improvements. Additionally, Synth-SBDH proves effective for rare SBDH categories and under-resource constraints while being substantially cheaper than expert-annotated real-world data. Human evaluation reveals a 71.06% Human-LLM alignment and uncovers areas for future refinements.
Anthology ID:
2025.emnlp-main.1418
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
27887–27923
Language:
URL:
https://aclanthology.org/2025.emnlp-main.1418/
DOI:
Bibkey:
Cite (ACL):
Avijit Mitra, Zhichao Yang, Emily Druhl, Raelene Goodwin, and Hong Yu. 2025. Synth-SBDH: A Synthetic Dataset of Social and Behavioral Determinants of Health for Clinical Text. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 27887–27923, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Synth-SBDH: A Synthetic Dataset of Social and Behavioral Determinants of Health for Clinical Text (Mitra et al., EMNLP 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.emnlp-main.1418.pdf
Checklist:
 2025.emnlp-main.1418.checklist.pdf