Multi-Source (Pre-)Training for Cross-Domain Measurement, Unit and Context Extraction

Yueling Li, Sebastian Martschat, Simone Paolo Ponzetto


Abstract
We present a cross-domain approach for automated measurement and context extraction based on pre-trained language models. We construct a multi-source, multi-domain corpus and train an end-to-end extraction pipeline. We then apply multi-source task-adaptive pre-training and fine-tuning to benchmark the cross-domain generalization capability of our model. Further, we conceptualize and apply a task-specific error analysis and derive insights for future work. Our results suggest that multi-source training leads to the best overall results, while single-source training yields the best results for the respective individual domain. While our setup is successful at extracting quantity values and units, more research is needed to improve the extraction of contextual entities. We make the cross-domain corpus used in this work available online.
Anthology ID:
2023.bionlp-1.1
Volume:
The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Dina Demner-fushman, Sophia Ananiadou, Kevin Cohen
Venue:
BioNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1–25
Language:
URL:
https://aclanthology.org/2023.bionlp-1.1
DOI:
10.18653/v1/2023.bionlp-1.1
Bibkey:
Cite (ACL):
Yueling Li, Sebastian Martschat, and Simone Paolo Ponzetto. 2023. Multi-Source (Pre-)Training for Cross-Domain Measurement, Unit and Context Extraction. In The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 1–25, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Multi-Source (Pre-)Training for Cross-Domain Measurement, Unit and Context Extraction (Li et al., BioNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.bionlp-1.1.pdf