Improved Topic Representations of Medical Documents to Assist COVID-19 Literature Exploration

Efﬁcient discovery and exploration of biomedical literature has grown in importance in the context of the COVID-19 pandemic, and topic-based methods such as latent Dirichlet allocation (LDA) are a useful tool for this purpose. In this study we compare traditional topic models based on word tokens with topic models based on medical concepts, and pro-pose several ways to improve topic coherence and speciﬁcity.


Introduction
As the Coronavirus 2019 (COVID-19) pandemic has presented unprecedented challenges to wellbeing and safety, the medical community has responded by rapidly conducting and publishing a vast amount of related research.This in turn has made it difficult for medical professionals and researchers to keep abreast of the latest evidence.Combined with the need to explore legacy literature on related coronaviruses such as SARS or MERS, there is a need for tools supporting efficient knowledge discovery and exploratory search that goes beyond simple text retrieval (Marchionini, 2006).In this context, representing and visualising the content of documents to allow the user to quickly identify relevant studies is becoming critically important.
Latent Dirichlet allocation (LDA: Blei et al. (2003)) is probably the most commonly used method for topic-based analysis of documents.It was applied by many systems in a recent Kaggle challenge over coronavirus literature, 1 and is used in search and exploration tools recently developed for COVID-19 research, 2 including our own COVID-SEE system 3 (Verspoor et al., 2020).However, while conventional LDA models work well 1 kaggle.com/allen-institute-for-ai/CORD-19-research-challenge 2 discovid.ai,strategicfutures.org/TopicMaps/COVID-19/ 3 covid-see.comfor topically-diverse document collections, they are less informative in narrow, knowledge-rich domains such as medicine, especially when the corpus consists of documents related to one broad topic, such as in the coronavirus-related literature.An ideal topic model should capture more specific, discriminating topics rather than generic topics made up of terms occurring in the majority of documents.
In this paper, we compare topic models based on medical concepts with traditional models based on words, and examine the nature of the inferred topics in terms of genericness and coherence.

Related work
Topic modelling has been applied in biomedical domain to cluster documents (Zhao et al., 2014), improve document retrieval (Yu et al., 2016) and discover biological relationships (Wang et al., 2011) or similar drugs (Bisgin et al., 2012).In practice, however, LDA is most commonly used to discover salient topics in a document collection (see, for example, Wang et al. (2016)), including for topical representation of COVID-19 related literature (Le Bras et al., 2020;Verspoor et al., 2020).Several attempts have been made to improve biomedical topic models by extracting a controlled set of biomedical entities (Wang et al., 2011) or using MeSH headers (Doshi-Velez et al., 2014); our approach differs in that we do not use a set of known relevant terms but rather filter out noninformative words to allow for more unrestricted knowledge discovery.In terms of topic quality, AlSumait et al. (2009) introduced the notion of "junk" (incoherent) and "background" (generic) topics, which are uniformly distributed over words and documents, respectively.However, though their method allows to rank topics based on their usefulness and quality, the authors do not experiment with improving them in terms of their specificity and coherence.
For our experiments we use the CORD-19 dataset, which is currently the most extensive coronavirus literature corpus publicly available (Wang et al., 2020).The dataset includes COVID-19 and coronavirus-related publications from various sources, such as PubMed Central open access corpus, research articles from a corpus maintained by the WHO, and bioRxiv and medRxiv pre-prints.
For our dataset we use abstracts of the papers in the corpus, or the first two paragraphs of the full text if no abstract is available.We remove documents in languages other than English using the CLD2 library. 4 The resulting dataset consists of 103955 documents with the average length of 156 words.
4 Topic modelling

LDA model
Latent Dirichlet allocation represents each document as a mixture of topics, and each topic as a mixture of words (Blei et al., 2003).We use an asymmetric prior for the document-topic distributions, as it has been shown to improve the robustness of the model (Wallach et al., 2009) and coherence of the topics learned from abstracts of scientific articles (Syed and Spruit, 2018).Following Blei and Lafferty (2006), we filter out tokens or concepts which occur in fewer than 20 documents or more than in 50% of the dataset, and remove stopwords based on PubMed's list. 5We trained models for 5 epochs, noting that convergence based on perplexity usually occurred by the fifth epoch.Preliminary experimentation identified that the optimal number of topics for the data set, based on the C v topic coherence measure (Röder et al., 2015), varied depending on the input representation (Fig. 1).To allow more direct comparison, we fix the number of topics at 25 for each model, as the coherence scores are close enough to each other at this point.

Document representation
We consider three different input representations of the text for inferring the models: • Word tokens: The input text was tokenised using the NLTK Tokeniser 6 .• Non-generic concepts: The UMLS conceptbased representation of the texts, with more general concepts filtered out.
The choice of representation based on medical (UMLS) concepts is motivated as follows: (1) it avoids splitting multi-word concepts (such as degenerative disease of the central nervous system) into less meaningful units (of , the, central, etc.); (2) it ensures that disambiguated homonyms, such as cats (C0007450, mammal) and cats (C1825121, gene), can be assigned to different topics by the model; and (3) it maps different lexicogrammatical variations of a given term into a single concept, thus reducing noise in the data, and highlighting important keywords.For example, concept C0000731 occurs in the articles as abdominal distension, abdominal distention, bloating, distended abdomens, swelling of abdomen, etc., which would not be captured by typical approaches to text normalisation such as lemmatisation, stemming, or n-gram overlap.We use a bag of concept identifiers (such as [C4038448 C1314792 C1443924 C0042963 C0392760 C2948600...]) to train the model, and then represent each identifier in the results by the lexicalisation that occurs most frequently in the document collection, as distinct from the MetaMap "preferred term", which is often a technical description rather than its lexical form (e.g.we use colon instead of preferred term Colon structure (body structure)).
We also attempt to filter out generic, or broad, concepts.Scientific publications contain many nonspecific terms, which can be part of their discourse structure (boilerplate sentences, section headings such as Discussion, phrases such as in conclusion), or be included in informative sentences but not be meaningful for the purposes of topic modelling.As adding all such words to a stop-word list would not be feasible, we filter the concepts based on their semantic type as defined in UMLS.Following ShafieiBavani et al. (2016) and Plaza et al. (2011), who used a similar approach to filter concepts for graph-based summarisation of medical documents, we exclude terms based on broad semantic types including QUANTITATIVE CONCEPT (rate, unit), We additionally exclude the following four semantic types: CONCEPTUAL ENTITY (example, step), ACTIV-ITY (contribute, activation), RESEARCH ACTIVITY (validate, research), and OCCUPATIONAL ACTIV-ITY (production, administration).

Evaluation
As automatic coherence measures cannot evaluate the quality of topics in terms of how useful or representative they are, we performed human evaluation.Two annotators -one of the authors and a medical professional -judged if a topic (represented by its 5 most frequent tokens or concepts) was coherent or not; coherent topics were further subdivided into specific and generic.This distinction is important as some topics can be highly coherent, but not informative.This is especially visible in datasets where the documents are homogeneous both in terms of style (scientific articles) and content (related to coronaviruses).For example, such topics as [research, study, approach ] or [coronavirus, virus, disease ] are coherent, but not representative of the content of the paper.In line with this, each topic was assigned one of three labels by the annotators: incoherent, specific, or generic.Following Newman et al. ( 2010), to evaluate coherence, annotators were asked to decide if each topic was meaningful and interpretable.To judge specificity, they were instructed to decide if a particular set of words is

Experiments
Table 1 shows the number of incoherent, generic and specific topics learned by each of the models.In general, using concepts improved topic coherence, while removing generic concepts helped to make topics even more specific.

Model analysis
Baseline model As can be seen in Table 2, the basic (word) token-based model suffers from multiple issues, some of which can be solved by expanding the stopword list (against, or); lemmatisation (methods/method, strains/strain) or removing nonalphabetic tokens ([1], 95%), but some, such as splitting multi-word terms (middle, east; respiratory, tract) are an unavoidable result of tokenisation.Because of such splitting less specific, but more frequent parts of multi-word terms (i.e.syndrome in acute respiratory syndrome) are more likely to be generated by the model, and in the result both topics and terms in them are more generic.

Concept-based model
The topics based on UMLS concepts are shown in Table 3.The conceptbased representation helps to improve the coherence, and also to produce granular topics with more specific terms, such as domain, peptide, residues, fusion, epitopes.However, some issues remain, such as generic topics and non-informative terms (e.g., associated with) inside specific topics.
Model based on non-generic concepts Topics learned after filtering of broad UMLS concepts are shown in Table 4.In addition to the overall improvement in terms of specific topics, it can be noted that some of the topics generated by this model are surprisingly granular and coherent, such as AMINO ACIDS AND THEIR NAMES (d, m, amino acid sequence, f, amino acid), ANTI-HIV DRUGS FOR CANCER TREATMENT (hiv, drug, development, cancer, inhibitors), or PREGNANCY AND BIRTH (neonatal, deliver, delivery, pregnancy, birth).

Drilling into topics
Unfortunately, an LDA model cannot learn a large number of highly-specific topics, as there is a tradeoff between the number of topics and their coherence.To achieve higher granularity, after we train the model, we subdivide the dataset based on the most prevalent topic in each document, and then train an LDA model on each subset.We experiment with two approaches here: the first model is the nongeneric concept model as described in Section 4.1 above, while in the second non-generic concepts are re-weighted based on their log-likelihood.We treat each of the articles in the subset as a target corpus, and the remainder of its documents as a background corpus, and compare concept distributions using the log-likelihood test (Rayson and Garside, 2000).This highlights concepts that differentiate a particular document from others discussing the same broad topic, even if they have the same set of frequent terms.
After assigning log-likelihood weights to concepts in each of the documents, we sort them by weight and use the top 50% to represent the document for topic modelling, thus discarding less salient terms.Table 5 shows the top-5 topics learned by these models from a subset corresponding to topic PROTEIN BINDING (mediated, binding, cell, pathway, receptor).It can be seen that while the count-based topics describe general aspects of protein binding and virus replication, the topics based on the log-likelihood test refer to specific proteins and viruses.Both of these models can be useful for medical researchers, allowing them to switch between a general view of major themes in a document collection and highly-specific topics to assist drug and treatment discovery.

Loglikelihood
[epithelial cells, ii, porcine, intestinal, sting]; [replication, h, transcription, hsv-1, phosphorylation]; [rnai, dsrna, gene expression, accumulation, polypeptide]; [n protein, cholesterol, exosomes, ebv, nucleolus]; [antigen, antibodies, t cells, integrin, cd8] Table 5: Top 5 topics based on counts vs log-likelihood weights and then using the resulting subsets for topic modelling helps to learn highly specific topics which highlight general aspects of the subject matter, if frequency-based representation is used, or more narrow questions, if only the concepts with the highest log-likelihood weights are included in the model.Unfortunately, the scope of this paper did not allow us to compare the results with those of neural topic models with word-based and conceptbased embeddings; we leave this question for further research.

Figure 1 :
Figure 1: Coherence scores for different representations of the CORD-19 corpus.