Modest von Korff


pdf bib
Exhaustive Indexing of PubMed Records with Medical Subject Headings
Modest von Korff
Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10)

With fourteen million publication records the PubMed database is one of the largest repositories in medical science. Analysing this database to relate biological targets to diseases is an important task in pharmaceutical research. We developed a software tool, MeSHTreeIndexer, for indexing the PubMed medical literature with disease terms. The disease terms were taken from the Medical Subject Heading (MeSH) Terms compiled by the National Institutes of Health (NIH) of the US. In a first semi-automatic step we identified about 5’900 terms as disease related. The MeSH terms contain so-called entry points that are synonymously used for the terms. We created an inverted index for these 5’900 MeSH terms and their 58’000 entry points. From the PubMed database fourteen million publication records were stored in Lucene. These publication records were tagged by the inverted MeSH term index. In this contribution we demonstrate that our approach provided a significant higher enrichment in MeSH terms than the indexing of the PubMed records by the NIH themselves. Manual control proved that our enrichment is meaningful. Our software was written in Java and is available as open source.