2021
pdf
bib
abs
Measuring the relative importance of full text sections for information retrieval from scientific literature.
Lana Yeganova
|
Won Gyu Kim
|
Donald Comeau
|
W John Wilbur
|
Zhiyong Lu
Proceedings of the 20th Workshop on Biomedical Language Processing
With the growing availability of full-text articles, integrating abstracts and full texts of documents into a unified representation is essential for comprehensive search of scientific literature. However, previous studies have shown that naïvely merging abstracts with full texts of articles does not consistently yield better performance. Balancing the contribution of query terms appearing in the abstract and in sections of different importance in full text articles remains a challenge both with traditional bag-of-words IR approaches and for neural retrieval methods. In this work we establish the connection between the BM25 score of a query term appearing in a section of a full text document and the probability of that document being clicked or identified as relevant. Probability is computed using Pool Adjacent Violators (PAV), an isotonic regression algorithm, providing a maximum likelihood estimate based on the observed data. Using this probabilistic transformation of BM25 scores we show an improved performance on the PubMed Click dataset developed and presented in this study, as well as the 2007 TREC Genomics collection.
2018
pdf
bib
abs
SingleCite: Towards an improved Single Citation Search in PubMed
Lana Yeganova
|
Donald C Comeau
|
Won Kim
|
W John Wilbur
|
Zhiyong Lu
Proceedings of the BioNLP 2018 workshop
A search that is targeted at finding a specific document in databases is called a Single Citation search. Single citation searches are particularly important for scholarly databases, such as PubMed, because users are frequently searching for a specific publication. In this work we describe SingleCite, a single citation matching system designed to facilitate user’s search for a specific document. We report on the progress that has been achieved towards building that functionality.
pdf
bib
abs
MeSH-based dataset for measuring the relevance of text retrieval
Won Gyu Kim
|
Lana Yeganova
|
Donald Comeau
|
W John Wilbur
|
Zhiyong Lu
Proceedings of the BioNLP 2018 workshop
Creating simulated search environments has been of a significant interest in infor-mation retrieval, in both general and bio-medical search domains. Existing collec-tions include modest number of queries and are constructed by manually evaluat-ing retrieval results. In this work we pro-pose leveraging MeSH term assignments for creating synthetic test beds. We select a suitable subset of MeSH terms as queries, and utilize MeSH term assignments as pseudo-relevance rankings for retrieval evaluation. Using well studied retrieval functions, we show that their performance on the proposed data is consistent with similar findings in previous work. We further use the proposed retrieval evaluation framework to better understand how to combine heterogeneous sources of textual information.
2017
pdf
bib
abs
BioCreative VI Precision Medicine Track: creating a training corpus for mining protein-protein interactions affected by mutations
Rezarta Islamaj Doğan
|
Andrew Chatr-aryamontri
|
Sun Kim
|
Chih-Hsuan Wei
|
Yifan Peng
|
Donald Comeau
|
Zhiyong Lu
BioNLP 2017
The Precision Medicine Track in BioCre-ative VI aims to bring together the Bi-oNLP community for a novel challenge focused on mining the biomedical litera-ture in search of mutations and protein-protein interactions (PPI). In order to support this track with an effective train-ing dataset with limited curator time, the track organizers carefully reviewed Pub-Med articles from two different sources: curated public PPI databases, and the re-sults of state-of-the-art public text mining tools. We detail here the data collection, manual review and annotation process and describe this training corpus charac-teristics. We also describe a corpus per-formance baseline. This analysis will provide useful information to developers and researchers for comparing and devel-oping innovative text mining approaches for the BioCreative VI challenge and other Precision Medicine related applica-tions.
2016
pdf
bib
PubTermVariants: biomedical term variants and their use for PubMed search
Lana Yeganova
|
Won Kim
|
Sun Kim
|
Rezarta Islamaj Doğan
|
Wanli Liu
|
Donald C Comeau
|
Zhiyong Lu
|
W John Wilbur
Proceedings of the 15th Workshop on Biomedical Natural Language Processing
2013
pdf
bib
Generalizing an Approximate Subgraph Matching-based System to Extract Events in Molecular Biology and Cancer Genetics
Haibin Liu
|
Karin Verspoor
|
Donald C. Comeau
|
Andrew MacKinlay
|
W. John Wilbur
Proceedings of the BioNLP Shared Task 2013 Workshop
pdf
bib
BioNLP Shared Task 2013: Supporting Resources
Pontus Stenetorp
|
Wiktoria Golik
|
Thierry Hamon
|
Donald C. Comeau
|
Rezarta Islamaj Doğan
|
Haibin Liu
|
W. John Wilbur
Proceedings of the BioNLP Shared Task 2013 Workshop
2012
pdf
bib
Classifying Gene Sentences in Biomedical Literature by Combining High-Precision Gene Identifiers
Sun Kim
|
Won Kim
|
Don Comeau
|
W. John Wilbur
BioNLP: Proceedings of the 2012 Workshop on Biomedical Natural Language Processing
2011
pdf
bib
Text Mining Techniques for Leveraging Positively Labeled Data
Lana Yeganova
|
Donald C. Comeau
|
Won Kim
|
W. John Wilbur
Proceedings of BioNLP 2011 Workshop