EHR-SeqSQL : A Sequential Text-to-SQL Dataset For Interactively Exploring Electronic Health Records

Jaehee Ryu, Seonhee Cho, Gyubok Lee, Edward Choi


Abstract
In this paper, we introduce EHR-SeqSQL, a novel sequential text-to-SQL dataset for Electronic Health Record (EHR) databases. EHR-SeqSQL is designed to address critical yet underexplored aspects in text-to-SQL parsing: interactivity, compositionality, and efficiency. To the best of our knowledge, EHR-SeqSQL is not only the largest but also the first medical text-to-SQL dataset benchmark to include sequential and contextual questions. We provide a data split and the new test set designed to assess compositional generalization ability. Our experiments demonstrate the superiority of a multi-turn approach over a single-turn approach in learning compositionality. Additionally, our dataset integrates specially crafted tokens into SQL queries to improve execution efficiency. With EHR-SeqSQL, we aim to bridge the gap between practical needs and academic research in the text-to-SQL domain.
Anthology ID:
2024.findings-acl.971
Volume:
Findings of the Association for Computational Linguistics ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand and virtual meeting
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
16388–16407
Language:
URL:
https://aclanthology.org/2024.findings-acl.971
DOI:
Bibkey:
Cite (ACL):
Jaehee Ryu, Seonhee Cho, Gyubok Lee, and Edward Choi. 2024. EHR-SeqSQL : A Sequential Text-to-SQL Dataset For Interactively Exploring Electronic Health Records. In Findings of the Association for Computational Linguistics ACL 2024, pages 16388–16407, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Cite (Informal):
EHR-SeqSQL : A Sequential Text-to-SQL Dataset For Interactively Exploring Electronic Health Records (Ryu et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.971.pdf