A Corpus of Scientific Biomedical Texts Spanning over 168 Years Annotated for Uncertainty

Ramona Bongelli, Carla Canestrari, Ilaria Riccioni, Andrzej Zuczkowski, Cinzia Buldorini, Ricardo Pietrobon, Alberto Lavelli, Bernardo Magnini


Abstract
Uncertainty language permeates biomedical research and is fundamental for the computer interpretation of unstructured text. And yet, a coherent, cognitive-based theory to interpret Uncertainty language and guide Natural Language Processing is, to our knowledge, non-existing. The aim of our project was therefore to detect and annotate Uncertainty markers ― which play a significant role in building knowledge or beliefs in readers' minds ― in a biomedical research corpus. Our corpus includes 80 manually annotated articles from the British Medical Journal randomly sampled from a 168-year period. Uncertainty markers have been classified according to a theoretical framework based on a combined linguistic and cognitive theory. The corpus was manually annotated according to such principles. We performed preliminary experiments to assess the manually annotated corpus and establish a baseline for the automatic detection of Uncertainty markers. The results of the experiments show that most of the Uncertainty markers can be recognized with good accuracy.
Anthology ID:
L12-1489
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2009–2014
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/823_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Ramona Bongelli, Carla Canestrari, Ilaria Riccioni, Andrzej Zuczkowski, Cinzia Buldorini, Ricardo Pietrobon, Alberto Lavelli, and Bernardo Magnini. 2012. A Corpus of Scientific Biomedical Texts Spanning over 168 Years Annotated for Uncertainty. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 2009–2014, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
A Corpus of Scientific Biomedical Texts Spanning over 168 Years Annotated for Uncertainty (Bongelli et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/823_Paper.pdf