Cochrane-auto: An Aligned Dataset for the Simplification of Biomedical Abstracts

Jan Bakker, Jaap Kamps


Abstract
The most reliable and up-to-date information on health questions is in the biomedical literature, but inaccessible due to the complex language full of jargon. Domain specific scientific text simplification holds the promise to make this literature accessible to a lay audience. Therefore, we create Cochrane-auto: a large corpus of pairs of aligned sentences, paragraphs, and abstracts from biomedical abstracts and lay summaries. Experiments demonstrate that a plan-guided simplification system trained on Cochrane-auto is able to outperform a strong baseline trained on unaligned abstracts and lay summaries. More generally, our freely available corpus complementing Newsela-auto and Wiki-auto facilitates text simplification research beyond the sentence-level and direct lexical and grammatical revisions.
Anthology ID:
2024.tsar-1.5
Volume:
Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024)
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Matthew Shardlow, Horacio Saggion, Fernando Alva-Manchego, Marcos Zampieri, Kai North, Sanja Štajner, Regina Stodden
Venue:
TSAR
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
41–51
Language:
URL:
https://aclanthology.org/2024.tsar-1.5
DOI:
Bibkey:
Cite (ACL):
Jan Bakker and Jaap Kamps. 2024. Cochrane-auto: An Aligned Dataset for the Simplification of Biomedical Abstracts. In Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024), pages 41–51, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Cochrane-auto: An Aligned Dataset for the Simplification of Biomedical Abstracts (Bakker & Kamps, TSAR 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.tsar-1.5.pdf