A Silver Standard Corpus of Human Phenotype-Gene Relations

Diana Sousa, Andre Lamurias, Francisco M. Couto


Abstract
Human phenotype-gene relations are fundamental to fully understand the origin of some phenotypic abnormalities and their associated diseases. Biomedical literature is the most comprehensive source of these relations, however, we need Relation Extraction tools to automatically recognize them. Most of these tools require an annotated corpus and to the best of our knowledge, there is no corpus available annotated with human phenotype-gene relations. This paper presents the Phenotype-Gene Relations (PGR) corpus, a silver standard corpus of human phenotype and gene annotations and their relations. The corpus consists of 1712 abstracts, 5676 human phenotype annotations, 13835 gene annotations, and 4283 relations. We generated this corpus using Named-Entity Recognition tools, whose results were partially evaluated by eight curators, obtaining a precision of 87.01%. By using the corpus we were able to obtain promising results with two state-of-the-art deep learning tools, namely 78.05% of precision. The PGR corpus was made publicly available to the research community.
Anthology ID:
N19-1152
Volume:
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
Month:
June
Year:
2019
Address:
Minneapolis, Minnesota
Editors:
Jill Burstein, Christy Doran, Thamar Solorio
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1487–1492
Language:
URL:
https://aclanthology.org/N19-1152/
DOI:
10.18653/v1/N19-1152
Bibkey:
Cite (ACL):
Diana Sousa, Andre Lamurias, and Francisco M. Couto. 2019. A Silver Standard Corpus of Human Phenotype-Gene Relations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1487–1492, Minneapolis, Minnesota. Association for Computational Linguistics.
Cite (Informal):
A Silver Standard Corpus of Human Phenotype-Gene Relations (Sousa et al., NAACL 2019)
Copy Citation:
PDF:
https://aclanthology.org/N19-1152.pdf
Poster:
 N19-1152.Poster.pdf
Code
 lasigeBioTM/PGR
Data
PGR