SciConceptMiner: A system for large-scale scientific concept discovery

Scientific knowledge is evolving at an unprecedented rate of speed, with new concepts constantly being introduced from millions of academic articles published every month. In this paper, we introduce a self-supervised end-to-end system, SciConceptMiner, for the automatic capture of emerging scientific concepts from both independent knowledge sources (semi-structured data) and academic publications (unstructured documents). First, we adopt a BERT-based sequence labeling model to predict candidate concept phrases with self-supervision data. Then, we incorporate rich Web content for synonym detection and concept selection via a web search API. This two-stage approach achieves highly accurate (94.7%) concept identification with more than 740K scientific concepts. These concepts are deployed in the Microsoft Academic production system and are the backbone for its semantic search capability.


Introduction
Scientific knowledge has been expanded at an exponential rate over the past decades and the fastgrowing volume of academic literature accentuates a pressing need for automated capture of finegrained emerging concepts. Statistical topic models (Blei, 2012), such as latent Dirichlet allocation (LDA) (Blei et al., 2003), have been wellrecognized for automatically extracting the topic structure of large document collections for past decades. However, it has two main limitations to prevent it from being widely applied in a modern large-scale document collection.
First, it is the scalability issue on the number of topics an LDA can model. The latest development (Chen et al., 2018) can process 131M documents with 28B tokens efficiently, however, it only extracts 1,722 topics. With the fast-growing body 1 https://academic.microsoft.com/ of scholarly communications, a comprehensive manually controlled vocabulary like Medical Subject Headings(MeSH) (Lowe and Barnett, 1994) contains tens of thousands of subjects (concepts) mostly in the bio-med domain; and an automated scientific knowledge exploration system such as Microsoft Academic Graph (MAG) (Shen et al., 2018) has hundreds of thousands of topics across all academic disciplines. A topic modeling system that is scalable not only to the size of documents but also to the number of topics is imperative.
Second, the result of an LDA model is a list of frequency-based terms that form a topic. It requires manual efforts to annotate such lists to generate a human-readable theme or topic name. An automatic process of identifying topic themes with authoritative names and meaningful descriptions is desired to reduce costly human interventions.
In this paper, we introduce a self-supervised end-to-end system, SciConceptMiner, for automatically discovering scientific concepts from both semi-structured independent knowledge sources and unstructured academic documents. It first obtains a list of concept candidates, either from external knowledge repositories such as Wikipedia (Völkel et al., 2006;Vrandečić and Krötzsch, 2014) and Unified Medical Language System (UMLS) (Bodenreider, 2004), or directly mining concepts from a collection of academic documents. Such concept lists are large and noisy. They are in the scale of millions and dominated by invalid or duplicate terms. We then send these candidates as queries to a search engine API and leverage rich Web content to identify legitimate concepts, cluster synonyms, and discard improper terms. The search API is also used to retrieve highquality concept descriptions.
One example is shown in Figure 1. 2 Four out of five trending topics (network embedding, triplet loss, network representation learning, and zero shot learning) under embedding are extracted by our automatic concept extractor model trained on CS corpus. It demonstrates that our designed model can effectively capture the emerging trending topics from the latest scientific articles.
The SciConceptMiner has been deployed to identify concepts from millions of scholarly communications in Microsoft Academic Graph (MAG) (Sinha et al., 2015;Wang et al., 2019Wang et al., , 2020. The MAG with the full list of 740K scientific concepts can be freely accessed via the Microsoft Academic 3 search website and MAG data set 4 .

System Description
As shown in Figure 2, the SciConceptMiner system has two stages: the first is the concept candidates discovery from various data sources; the second is synonym detection and concept clustering via a Web search API.
In the concept candidates discovery stage, we first integrate the semi-structured independent knowledge sources, Wikipedia and UMLS, into the system. Such an existing concept list in the system with associated documents enables us to train a concept extractor learning model with selfsupervision. We design a BERT-based sequence labeling model to make a binary prediction on whether a word or phrase in a sentence is a scientific concept or not. This proposed model is trained on self-supervised data generated from existing concepts (from Wikipedia and UMLS) tagged to a collection of academic documents. We do the concept inference with the trained model to generate concept candidates for the next stage.
Concept candidates, as the input to the second stage, are either from external knowledge sources or inferred from academic documents. Both sources have high noisy ratios with different natures. The independent source such as Wikipedia has high-quality entities (well-defined names and descriptions, rare duplication, and rich links and relationships with each other) but type noisy (many other types of entities than academic concepts). The UMLS candidates and the inferred candidates from an unstructured corpus have more irrelevant phrases and concept synonyms. With the help of a search engine API to retrieve top N documents by using concept candidates as queries, we analyze the returning web pages and associated URL domain information collectively. This process would iden-tify around 3-5% of candidates from the first stage as proper scientific concepts with consistently high accuracy (94-95% based on sample results) across all data sources, with over 740K concepts in total.

Semi-structured Independent Knowledge Sources
There are many independent knowledge sources, either manually curated or automatically created or a hybrid of both. Among them, the most notable ones are Wikipedia, WikiData 5 , DBpedia 6 , and Yago 7 in general domains and MeSH 8 , UMLS 9 in the bio-med fields. We have applied Wikipedia and UMLS as sources for SciConceptMiner system because of their data quality and comprehensive coverage on scientific terms and phrases. Other semi-structured sources can be integrated with the current system design seamlessly as long as they pass the quality and relevancy examination of their contents.
Wikipedia: Wikipedia 10 is the largest collaboratively edited online encyclopedic knowledge. It contains contents in more than 300 languages and has over 6 million English articles as of July 2020. It was the first external data source being integrated into MAG considering its comprehensive coverage on academic topics spanning from social sciences to natural sciences, as well as technology and applied sciences. Each topic in Wikipedia (as a separate article) is written in high quality and has rare duplication (Lewoniewski, 2018). The key challenge of mining quality academic concepts from Wikipedia is to identify the right type of entities, as most articles in Wikipedia are missing entity type information. We used graph link analysis (Milne and Witten, 2008) for type prediction and had expanded the concepts from an initial 3K to over 200K. The details are described in the Concept Discovery section in (Shen et al., 2018). For concepts from Wikipedia, we did not use the search engine API to further filter as the resulting concept list is already with high quality and rare duplication. UMLS: is a repository of biomedical vocabularies developed by the US National Library of Medicine (NLM) with sources from multiple datasets and standards. The latest 2020AA release contains approximately 4.28 million medical concepts and 15.5 million unique concept names from over 200 sources. A system with large, complex data sources typically has various inherent limitations on the data quality. For UMLS, these include structural inconsistencies such as cycles in graph hierarchy, semantic inconsistencies between different vocabularies, and missing hierarchical relationships (Bodenreider, 2004(Bodenreider, , 2007Humphreys et al., 1998).
In the concept candidate discovery stage, we take the full list of the concept names from UMLS and first clean it with simple rules such as removing digit-only terms, two-char terms, too long terms (over 30 chars), etc. We further filter the remaining terms with a corpus consisting of titles and abstracts from 170 million English scientific articles in MAG and only keep terms that appeared at least N times in above academic corpus. The resulting list is ready to be sent to a search engine API for duplication detection and concept selection in the second stage.

Self-supervised Concept Extractor Learning
The volume of new research being published is rapidly increasing, with MAG adding over 1 million new papers every month. This creates a unique challenge to identify, describe, and categorize an ever-evolving set of emerging concepts in a timely fashion.
To tackle this challenge, we formulate the concept detection as a self-supervised sequence labeling problem that allows us to extract concept candidates directly from unstructured academic documents. This is motivated by the recent development of deep learning (DL) based Named Entity Recognition (NER) models, which become dominant and achieve state-of-the-art results (Lample et al., 2016;Chiu and Nichols, 2016;Yadav and Bethard, 2019). NER is the task of identifying named entities of a specific type, such as person or location, in text. A most recent survey (Li et al., 2020) proposed a new taxonomy of DL-based NER with three parts: distributed representations for input, context encoder, and tag decoder. We adopt this taxonomy to design our concept extractor learning model.
Instead of a typical NER model which would learn to identify several entity types at the same time, we reduce our model design to identify a single entity type -scientific concept type. We propose to treat scientific concept extraction as a sequence labeling task. Tokens in the text are labeled with the BIO notation. 'B', 'I', and 'O' represent the beginning, inside, and outside of a scientific concept chunk respectively. On a sampled set of scientific articles in MAG, we do lexical matching using the synonyms of our existing concepts harvested from Wikipedia and UMLS as self-supervised labels. We fine-tune a transformer-based BERT model (Devlin et al., 2018) (e.g. BERT-Large) as a context encoder and use a Conditional Random Field (CRF) layer as a tag decoder to train a binary classifier on each word in a sentence to detect concept mentions. 11 Figure 3 illustrates the design of our concept extractor learning model. We infer scientific concept candidates using the trained model on a larger set of high-quality MAG documents, i.e. those published in prestigious journals/conferences. Figure 4 provides some self-supervised concept labeling samples as well as sample sentences with inferred new concepts. These new concept candidates are ready to be used in the next stage.

Synonym Detection and Concept Selection
In the second stage, we classify the scientific concept candidates detected in the first stage (either from UMLS or from automatic concept extractor models) into three broad categories: (1) synonyms of existing concepts, (2) new concepts, or (3) lowquality words/phrases we shall discard. 11 We re-use the BERT vocabularies and their pre-trained embedding without regenerating and retraining on academic corpus. This is accomplished by searching for each concept candidate using the Bing Web Search API 12 and clustering candidates into scientific concept "identities" based on the URL relevance/reputation and the consistency of the mentions among top search results.
More specifically, if K out of top N URLs returned by two concept candidates is the same, we consider these two candidates are synonyms of a concept. We also curate the allowed-list and block-list of URL domains. The concept candidates whose top search results are from well-known domains of high-quality academic knowledge (in the allowed-list) would be accepted, and otherwise, they would be rejected. The block-list is used to reject terms that also have results from domains in the allowed list. That is usually the case for common words and phrases which returned with pages in online dictionary domains.
This simple yet effective approach can help trim around 92%-97% concept candidates as noisy terms and keep 3%-7% of high-quality concepts, synonyms, and well-written descriptions from domains containing credible academic knowledge and are in the allowed-list.

Self-supervised concept extractor learning
We use the BERT-Large-Cased as the pre-trained language model and fine-tune the described con- To ensure that this model works for documents across various scientific domains, we conduct experiments training our model using documents in different top domains (e.g. computer science and medicine). We observe that higher-quality candidates are generated using models trained from the same domain corpus. For example, when we apply the model trained with a CS corpus to predict concepts in the medicine corpus, the F1 score drops from 0.942 to 0.682. Therefore, we train different models on the corpus from an individual top-level domain, and the F1 scores of inference results on in-domain and out-of-domain corpus are shown in Table 2.  We have only conducted model training and inference on CS and medicine corpus. Continued training on other discipline corpora as well as exploring more effective concept extractor learning models are among our ongoing efforts.

Concept Analysis Based on Data Sources
In this section, we conduct an evaluation of the concept quality in terms of accuracy and coverage. We estimate the coverage by evaluating potential missed opportunities on discarded terms. We also leverage MAG data to conduct the analysis of top domain distribution and topic age distribution conditioned on different data sources.
The stats in this section are collected on four groups of concepts by their data sources: Wikipedia, UMLS, automatically extracted concepts on Computer Science (AutoCS or A-CS) and Medicine (AutoMed or A-Med) corpus respectively. Since the concepts discovered in SciConceptMiner are already integrated into MAG, we use the paperconcept relationship, concept hierarchy, and paper metadata such as publication year in MAG to facilitate this analysis. The details on how to obtain these relationships and meta-data are out of the scope of this work and please refer to (Wang et al., 2019;Shen et al., 2018) for more information.

Size, Impact, and Accuracy
In Table 3, we report the number of concepts, average number of papers associated with a concept, average citation received of a paper tagged with a concept, as well as the accuracy of concepts. The independent knowledge sources (Wikipedia and UMLS) provide similar topic sizes on a scale of hundreds of thousands, while the automatic extraction models identify about one-tenth of the size from external sources. On average, the concepts from Wikipedia are broader (with more papers associated) and have a higher impact (with more citations received), while concepts from UMLS are more fine-grained with slightly smaller influence. We evaluate the accuracy with the same approach described in (Shen et al., 2018) and it achieves a similar accuracy level between 94% and 95% across all data sources.

Potential Opportunities on Discarded Contents
It is generally challenging to evaluate the coverage of such a large-scale concept discovery system since it is nearly impossible to identify the "ground truth" of full coverage, even in a narrowed subdomain. In order to estimate the coverage, we identify the potential opportunities that we may have missed by sampling and inspecting the discarded inferred terms from learned concept extractor models. We sample 300 discarded terms in AutoCS and AutoMed respectively and report the size and accu-racy 13 in

Topic Age Distribution
In Table 6, we report the average age of the papers associated with a concept. The average publication year (rounded off to the floor), as well as 5%, 50% (the median), and 95% publication year of a concept are also reported. It shows that concepts from UMLS are generally discovered and used in earlier years, lasting longer (25 years for the middle 90%), while AutoCS and AutoMed contain newer concepts with shorter life span (17-18 years for the middle 90%). 13 We split the sampled data of each category to 3 groups with 100 each and they are evaluated by 3 judges. We report the average of positive label ratios.
Source Age Avg Y 5% Y 50% Y 95% Y  Figure 5 provides a yearly distribution from 2010 to 2019. It represents the percentage of papers (associated with concepts in respective sources) over the past 10 years. 14 This is consistent with our expectation as one of our primary goals of leveraging the automatic concept extraction is to discover emerging concepts in the latest scientific documents.

Conclusion
In this work, we demonstrated a large-scale scientific concept discovery production system, SciCon-ceptMiner, for automatically capturing academic concepts from both semi-structured data and unstructured documents. The system has two parts: the first is the concept candidate identification, and the second is synonym detection and concept selection. We used a BERT-based sequence labeling model to learn concept phrases with selfsupervision and leverage a Web search API to cluster synonyms and identify valid concepts.
SciConceptMiner has discovered more than 740K scientific concepts across all research domains from Wikipedia, UMLS, and scholarly articles with high accuracy (94.7%). These concepts are integrated to build the Microsoft Academic Graph, which publishes one of the largest crossdomain scientific taxonomy. It enables easy exploration of scientific knowledge as well as facilitates many downstream applications like information retrieval, question answering, and recommendations.