LotusOrchid at #SMM4H–HeaRD 2026: Fitting pretrained encoders for Dutch medical data

Sophie Arnoult; Shutao Chen; Piek Vossen

LotusOrchid at #SMM4H–HeaRD 2026: Fitting pretrained encoders for Dutch medical data

Sophie Arnoult, Shutao Chen, Piek Vossen

Abstract

This paper presents our submission to MultiClinAI’s NER subtask for #SMM4H-HeaRD 2026. We focus on the questions 1) which Language Model represents the clinical notes best and 2) which annotations can help training these models. To get answers for these questions, we follow a token-based classification approach with pretrained encoder language models, where we compare models that were pretrained on generic data against medical data, and on a single language, Dutch, against many languages. In addition, we present two data-augmented systems: one with data from the other languages of the workshop for multilingual training, and one with synthetic annotations.

Anthology ID:: 2026.smm4h-1.23
Volume:: Proceedings of the 11th Social Media Mining for Health Research and Applications (SMM4H-HeaRD 2026) Workshop and Shared Tasks
Month:: July
Year:: 2026
Address:: San Diego, United States
Editors:: Guillermo Lopez-Garcia, Graciela Gonzalez-Hernandez
Venues:: SMM4H | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 139–145
Language:
URL:: https://aclanthology.org/2026.smm4h-1.23/
DOI:
Bibkey:
Cite (ACL):: Sophie Arnoult, Shutao Chen, and Piek Vossen. 2026. LotusOrchid at #SMM4H–HeaRD 2026: Fitting pretrained encoders for Dutch medical data. In Proceedings of the 11th Social Media Mining for Health Research and Applications (SMM4H-HeaRD 2026) Workshop and Shared Tasks, pages 139–145, San Diego, United States. Association for Computational Linguistics.
Cite (Informal):: LotusOrchid at #SMM4H–HeaRD 2026: Fitting pretrained encoders for Dutch medical data (Arnoult et al., SMM4H 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.smm4h-1.23.pdf

PDF Cite Search Fix data