Measuring Fine-Grained Domain Relevance of Terms: A Hierarchical Core-Fringe Approach

We propose to measure fine-grained domain relevance– the degree that a term is relevant to a broad (e.g., computer science) or narrow (e.g., deep learning) domain. Such measurement is crucial for many downstream tasks in natural language processing. To handle long-tail terms, we build a core-anchored semantic graph, which uses core terms with rich description information to bridge the vast remaining fringe terms semantically. To support a fine-grained domain without relying on a matching corpus for supervision, we develop hierarchical core-fringe learning, which learns core and fringe terms jointly in a semi-supervised manner contextualized in the hierarchy of the domain. To reduce expensive human efforts, we employ automatic annotation and hierarchical positive-unlabeled learning. Our approach applies to big or small domains, covers head or tail terms, and requires little human effort. Extensive experiments demonstrate that our methods outperform strong baselines and even surpass professional human performance.


Introduction
With countless terms in human languages, no one can know all terms, especially those belonging to a technical domain. Even for domain experts, it is quite challenging to identify all terms in the domains they are specialized in. However, recognizing and understanding domain-relevant terms is the basis to master domain knowledge. And having a sense of domains that terms are relevant to is an initial and crucial step for term understanding.
In this paper, as our problem, we propose to measure fine-grained domain relevance, which is defined as the degree that a term is relevant to a given domain, and the given domain can be broad or narrow-an important property of terms that has not been carefully studied before. E.g., deep learning is a term relevant to the domains of computer science and, more specifically, machine learning, but not so much to others like database or compiler. Thus, it has a high domain relevance for the former domains but a low one for the latter. From another perspective, we propose to decouple extraction and evaluation in automatic term extraction that aims to extract domain-specific terms from texts (Amjadian et al., 2018;Hätty et al., 2020). This decoupling setting is novel and useful because it is not limited to broad domains where a domain-specific corpus is available, and also does not require terms must appear in the corpus.
A good command of domain relevance of terms will facilitate many downstream applications. E.g., to build a domain taxonomy or ontology, a crucial step is to acquire relevant terms (Al-Aswadi et al., 2019;Shang et al., 2020). Also, it can provide or filter necessary candidate terms for domain-focused natural language tasks (Huang et al., 2020). In addition, for text classification and recommendation, the domain relevance of a document can be measured by that of its terms.
We aim to measure fine-grained domain relevance as a semantic property of any term in human languages. Therefore, to be practical, the proposed model for domain relevance measuring must meet the following requirements: 1) covering almost all terms in human languages; 2) applying to a wide range of broad and narrow domains; and 3) relying on little or no human annotation.
However, among countless terms, only some of them are popular ones organized and associated with rich information on the Web, e.g., Wikipedia pages, which we can leverage to characterize the domain relevance of such "head terms." In contrast, there are numerous "long-tail terms"-those not as frequently used-which lack descriptive information. As Challenge 1, how to measure the domain relevance for such long-tail terms?
On the other hand, among possible domains of interest, only those broad ones (e.g., physics, computer science) naturally have domain-specific corpora. Many existing works (Velardi et al., 2001;Amjadian et al., 2018;Hätty et al., 2020) have relied on such domain-specific corpora to identify domain-specific terms by contrasting their distributions to general ones. In contrast, those fine-grained domains (e.g., quantum mechanics, deep learning)which can be any topics of interest-do not usually have a matching corpus. As Challenge 2, how to achieve good performance for a fine-grained domain without assuming a domain-specific corpus?
Finally, automatic learning usually requires large amounts of training data. Since there are countless terms and plentiful domains, human annotation is very time-consuming and laborious. As Challenge 3, how to reduce expensive human efforts when applying machine learning methods to our problem?
As our solutions, we propose a hierarchical corefringe domain relevance learning approach that addresses these challenges. First, to deal with longtail terms, we design the core-anchored semantic graph, which includes core terms which have rich description and fringe terms without that information. Based on this graph, we can bridge the domain relevance through term relevance and include any term in evaluation. Second, to leverage the graph and support fine-grained domains without relying on domain-specific corpora, we propose hierarchical core-fringe learning, which learns the domain relevance of core and fringe terms jointly in a semi-supervised manner contextualized in the hierarchy of the domain. Third, to reduce human effort, we employ automatic annotation and hierarchical positive-unlabeled learning, which allow to train our model with little even no human effort.
Overall, our framework consists of two processes: 1) the offline construction process, where a domain relevance measuring model is trained by taking a large set of seed terms and their features as input; 2) the online query process, where the trained model can return the domain relevance of query terms by including them in the core-anchored semantic graph. Our approach applies to a wide range of domains and can handle any query, while nearly no human effort is required. To validate the effectiveness of our proposed methods, we conduct extensive experiments on various domains with different settings. Results show our methods significantly outperform well-designed baselines and even surpass human performance by professionals.

Related Work
The problem of domain relevance of terms is related to automatic term extraction, which aims to extract domain-specific terms from texts automatically. Compared to our task, automatic term extraction, where extraction and evaluation are combined, possesses a limited application and has a relatively large dependence on corpora and human annotation, so it is limited to several broad domains and may only cover a small number of terms. Existing approaches for automatic term extraction can be roughly divided into three categories: linguistic, statistical, and machine learning methods. Linguistic methods apply human-designed rules to identify technical/legal terms in a target corpus (Handler et al., 2016;Ha and Hyland, 2017). Statistical methods use statistical information, e.g., frequency of terms, to identify terms from a corpus (Frantzi et al., 2000;Nakagawa and Mori, 2002;Velardi et al., 2001;Drouin, 2003;Meijer et al., 2014). Machine learning methods learn a classifier, e.g., logistic regression classifier, with manually labeled data (Conrado et al., 2013;Fedorenko et al., 2014;Hätty et al., 2017). There also exists some work on automatic term extraction with Wikipedia (Vivaldi et al., 2012;Wu et al., 2012). However, terms studied there are restricted to terms associated with a Wikipedia page.
Recently, inspired by distributed representations of words (Mikolov et al., 2013a), methods based on deep learning are proposed and achieve state-ofthe-art performance. Amjadian et al. (2016Amjadian et al. ( , 2018 design supervised learning methods by taking the concatenation of domain-specific and general word embeddings as input. Hätty et al. (2020) propose a multi-channel neural network model that leverages domain-specific and general word embeddings.
The techniques behind our hierarchical corefringe learning methods are related to research on graph neural networks (GNNs) (Kipf and Welling, 2017;Hamilton et al., 2017); hierarchical text classification (Vens et al., 2008;Wehrmann et al., 2018;Zhou et al., 2020); and positive-unlabeled learning (Liu et al., 2003;Elkan and Noto, 2008;Bekker and Davis, 2020 Figure 1: The overview of the framework. In this figure, machine learning is a core term associated with a Wikipedia page, few-shot learning is a fringe term included in the offline core-anchored semantic graph, and quantum chemistry is a fringe term included in the online process. Best viewed in color.

Methodology
We study the Fine-Grained Domain Relevance of terms, which is defined as follows: Definition 1 (Fine-Grained Domain Relevance) The fine-grained domain relevance of a term is the degree that the term is relevant to a given domain, and the given domain can be broad or narrow.
The domain relevance of terms depends on many factors. In general, a term with higher semantic relevance, broader meaning scope, and better usage possesses a higher domain relevance regarding the target domain. To measure the fine-grained domain relevance of terms, we propose a hierarchical corefringe approach, which includes an offline training process and can handle any query term in evaluation. The overview of the framework is illustrated in Figure 1.

Core-Anchored Semantic Graph
There exist countless terms in human languages; thus it is impractical to include all terms in a system initially. To build the offline system, we need to provide seed terms, which can come from knowledge bases or be extracted from broad, large corpora by existing term/phrase extraction methods (Handler et al., 2016;Shang et al., 2018).
In addition to providing seed terms, we should also give some knowledge to machines so that they can differentiate whether a term is domain-relevant or not. To this end, we can leverage the description information of terms. For instance, Wikipedia contains a large number of terms (the surface form of page titles), where each term is associated with a Wikipedia article page. With this page information, humans can easily judge whether a term is domain-relevant or not. In Section 3.3, we will show the labeling can even be done completely automatically.
However, considering the countless terms, the number of terms that are well-organized and associated with rich description is small. How to measure the fine-grained domain relevance of terms without rich information is quite challenging for both machines and humans.
Fortunately, terms are not isolated, while complex relations exist between them. If a term is relevant to a domain, it must also be relevant to some domain-relevant terms and vice versa. This is to say, we can bridge the domain relevance of terms through term relevance. Summarizing the observations, we divide terms into two categories: core terms, which are terms associated with rich description information, e.g., Wikipedia article pages, and fringe terms, which are terms without that information. We assume, for each term, there exist some relevant core terms that share similar domains. If we can find the most relevant core terms for a given term, its domain relevance can be evaluated with the help of those terms. To this end, we can utilize the rich information of core terms for ranking.
Taking Wikipedia as an example, each core term is associated with an article page, so they can be returned as the ranking results (result term) for a given term (query term). Considering the data resources, we use the built-in Elasticsearch based Wikipedia search engine 2 (Gormley and Tong, 2015). More specifically, we set the maximum number of links as k (5 as default). For a query term v, i.e., any seed term, we first achieve the top 2k Wikipedia pages with exact match. For each result term u in the core, we create a link from u to v. If the number of links is smaller than k, we do this process again without exact match and build additional links. Finally, we construct a term graph, named Core-Anchored Semantic Graph, where nodes are terms and edges are links between terms.
In addition, for terms that are not provided initially, we can also handle them as fringe terms and connect them to core terms in evaluation. In this way, we can include any term in the graph.

Hierarchical Core-Fringe Learning
In this section, we aim to design learning methods to learn the fine-grained domain relevance of core and fringe terms jointly. In addition to using the term graph, we can achieve features of both core and fringe terms based on their linguistic and statistical properties (Terryn et al., 2019;Conrado et al., 2013) or distributed representations (Mikolov et al., 2013b;Yu and Dredze, 2015). We assume the labels, i.e., domain-relevant or not, of core terms are available, which can be achieved by an automatic annotation mechanism introduced in Section 3.3.
As stated above, if a term is highly relevant to a given domain, it must also be highly relevant to some other terms with a high domain relevance and vice versa. Therefore, to measure the domain relevance of a term, in addition to using its own features, we aggregate its neighbors' features. Specifically, we propagate the features of terms via the term graph and use the label information of core terms for supervision. In this way, core and fringe terms help each other, and the domain relevance is learned jointly. The propagation process can be achieved by graph convolutions (Hammond et al., 2011). We first apply the vanilla graph convolutional networks (GCNs) (Kipf and Welling, 2017) in our framework. The graph convolution operation (GCNConv) at the l-th layer is formulated as the following aggregation and update process: is the hidden state of node j at the l-th layer, with d (l) being the number of units; h c is the bias vector. φ(·) is the nonlinearity activation function, e.g., ReLU(·) = max(0, ·).
Since core terms are labeled as domain-relevant or not, we can use the labels to calculate the loss: where y i is the label of node i regarding the target domain, and z i = σ(h o i ), with h o i being the output of the last GCNConv layer for node i and σ(·) being the sigmoid function. The weights of the model are trained by minimizing the loss. The relative domain relevance is obtained as s = z.
Combining with the overall framework, we get the first domain relevance measuring model, CFL, i.e., Core-Fringe Domain Relevance Learning.
CFL is useful to measure the domain relevance for broad domains such as computer science. For domains with relatively narrow scopes, e.g., machine learning, we can also leverage the label information of domains at the higher level of the hierarchy, e.g., CS → AI → ML, which is based on the idea that a domain-relevant term regarding the target domain should also be relevant to the parent domain. Inspired by related work on hierarchical multi-label classification (Vens et al., 2008;Wehrmann et al., 2018), we introduce a hierarchical learning method considering both global and local information.
We first apply l c GCNConv layers according to Eq. (1) and get the output of the last GCNConv layer, which is h (lc) i . In order not to confuse, we omit the subscript that identifies the node number. For each domain in the hierarchy, we introduce a hierarchical global activation a p . The activation at the (l + 1)-th level of the hierarchy is given as where [·; ·] indicates the concatenation of two vectors; a (1) p ). The global in-formation is produced after a fully connected layer: where l p is the total number of hierarchical levels. To achieve the local information for each level of the hierarchy, the model first generates the local hidden state a (l) q by a fully connected layer: The local information at the l-th level of the hierarchy is then produced as In our core-fringe framework, all the core terms are labeled at each level of the hierarchy. Therefore, the loss of hierarchical learning is computed as where y (l) denotes the labels regarding the domain at the l-th level of the hierarchy and (z, y) is the binary cross-entropy loss described in Eq.
(2). In testing, The relative domain relevance s is calculated as where • denotes element-wise multiplication. α is a hyperparameter to balance the global and local information (0.5 as default). Combining with our general framework, we refer to this model as HiCFL, i.e., Hierarchical CFL.
Online Query Process. If seed terms are provided by extracting from broad, large corpora relevant to the target domain, most terms of interest will be already included in the offline process. In evaluation, for terms that are not provided initially, our model treats them as fringe terms. Specifically, when receiving such a term, the model connects it to core terms by the method described in Section 3.1. With its features (e.g., compositional term embeddings) or only its neighbors' features (when features cannot be generated directly), the trained model can return the domain relevance of any query.

Automatic Annotation and Hierarchical Positive-Unlabeled Learning
Automatic Annotation. For the fine-grained domain relevance problem, human annotation is very time-consuming and laborious because the number of core terms is very large regarding a wide range of domains. Fortunately, in addition to building the term graph, we can also leverage the rich information of core terms for automatic annotation.
In the core-anchored semantic graph constructed with Wikipedia, each core term is associated with a Wikipedia page, and each page is assigned one or more categories. All the categories form a hierarchy, furthermore providing a category tree. For a given domain, we can first traverse from a root category and collect some gold subcategories. For instance, for computer science, we treat category: subfields of computer science 3 as the root category and take categories at the first three levels of it as gold subcategories. Then we collect categories for each core term and examine whether the term itself or one of the categories is a gold subcategory. If so, we label the term as positive. Otherwise, we label it as negative. We can also combine gold subcategories from some existing domain taxonomies and extract the categories of core terms from the text description, which usually contains useful text patterns like "x is a subfield of y".
Hierarchical Positive-Unlabeled Learning. According to the above methods, we can learn the finegrained domain relevance of terms for any domain as long as we can collect enough gold subcategories for that domain. However, for domains at the low level of the hierarchy, e.g., deep learning, a category tree might not be available in Wikipedia. To deal with this issue, we apply our learning methods in a positive-unlabeled (PU) setting (Bekker and Davis, 2020), where only a small number of terms, e.g., 10, are labeled as positive, and all the other terms are unlabeled. We use this setting based on the following consideration: if a user is interested in a specific domain, it is quite easy for her to give some important terms relevant to that domain.
Benefiting from our hierarchical core-fringe learning approach, we can still obtain labels for domains at the high level of the hierarchy with the automatic annotation mechanism. Therefore, all the negative examples of the last labeled hierarchy can be used as reliable negatives for the target domain. For instance, if the target domain is deep learning, which is in the CS → AI → ML → DL hierarchy, we consider all the non-ML terms as the reliable negatives for DL. Taking the positively labeled examples and the reliable negatives for supervision, we can learn the domain relevance of terms by our proposed HiCFL model contextualized in the hierarchy of the domain.

Experiments
In this section, we evaluate our model from different perspectives. 1) We compare with baselines by treating some labeled terms as queries. 2) We compare with human professionals by letting humans and machines judge which term in a query pair is more relevant to a target domain. 3) We conduct intuitive case studies by ranking terms according to their domain relevance.

Experimental Setup
Datasets and Preprocessing. To build the system, for offline processing, we extract seed terms from the arXiv dataset (version 6) 4 . As an example, for computer science or its sub-domains, we collect the abstracts in computer science according to the arXiv Category Taxonomy 5 , and apply phrasemachine to extract terms (Handler et al., 2016) with lemmatization and several filtering rules: frequency > 10; length ≤ 6; only contain letters, numbers, and hyphen; not a stopword or a single letter.
We select three broad domains, including computer science (CS), physics (Phy), and mathematics (Math); and three narrow sub-domains of them, including machine learning (ML), quantum mechanics (QM), and abstract algebra (AA), with the hierarchies CS → AI → ML, Phy → mechanics → QM, and Math → algebra → AA. Each broad domain and its sub-domains share seed terms because they share a corpus. To achieve gold subcategories for automatic annotation (Section 3.3), we collect subcategories at the first three levels of a root category (e.g., category: subfields of physics) for broad domains (e.g., physics); or the first two levels for narrow domains, e.g., category: machine learning for machine learning. Table 1 reports the total sizes and the ratios that are core terms.
Baselines. Since our task on fine-grained domain relevance is new, there is no existing baseline for model comparison. We adapt the following models on relevant tasks in our setting with additional inputs (e.g., domain-specific corpora): • Relative Domain Frequency (RDF): Since domain-relevant terms usually occur more in a domain-specific corpus, we apply a statistical method using freq s (w)/freq g (w) to measure the domain relevance of term w, where freq s (·) and freq g (·) denote the frequency of occurrence in the domain-specific/general corpora respectively. • Logistic Regression (LR): Logistic regression is a standard supervised learning method. We use core terms with labels (domain-relevant or not) as training data, where features are term embeddings trained by a general corpus. • Multilayer Perceptron (MLP): MLP is a standard neural neural-based model. We train MLP using embeddings trained with a domain-specific corpus or a general corpus as term features, respectively. We also concatenate the two embeddings as features (Amjadian et al., 2016(Amjadian et al., , 2018. • Multi-Channel (MC): Multi-Channel (Hätty et al., 2020) is the state-of-the-art model for automatic term extraction, which is based on a multi-channel neural network that takes domainspecific and general corpora as input.
Training. For all supervised learning methods, we apply automatic annotation in Section 3.3, i.e., we automatically label all the core terms for model training. In the PU setting, we remove labels on target domains. Only 20 (10 in the case studies) domain-relevant core terms are randomly selected as the positives, with the remaining terms unlabeled. In training, all the negative examples at the previous level of the hierarchy are used as reliable negatives.
Implementation Details. Though our proposed methods are independent of corpora, some baselines (e.g., MC) require term embeddings trained from general/domain-specific corpora. For easy and fair comparison, we adopt the following approach to generate term features. We consider each term as a single token, and apply word2vec CBOW (Mikolov et al., 2013a) with negative sampling, where dimensionality is 100, window size is 5, and number of negative samples is 5. The training cor-   pus can be a general one (the entire arXiv corpus, denoted as G), or a domain-specific one (the subcorpus in the branch of the corresponding domain, denoted as S). We also apply compositional GloVe embeddings (Pennington et al., 2014) (elementwise addition of the pre-trained 100d word embeddings, denoted as C) as non-corpus-specific features of terms for reference.
For all the neural network-based models, we use Adam (Kingma and Ba, 2015) with learning rate of 0.01 for optimization, and adopt a fixed hidden dimensionality of 256 and a fixed dropout ratio of 0.5. For the learning part of CFL and HiCFL, we apply two GCNConv layers and use the symmetric graph for training. To avoid overfitting, we adopt batch normalization (Ioffe and Szegedy, 2015) right after each layer (except for the output layer) and before activation and apply dropout (Hinton et al., 2012) after the activation. We also try to add regularizations for MLP and MC with full-batch or mini-batch training, and select the best architecture. To construct the core-anchored semantic graph, we set k as 5. All experiments are run on an NVIDIA Quadro RTX 5000 with 16GB of memory under the PyTorch framework. The training of CFL for the CS domain can finish in 1 minute.
We report the mean and standard deviation of the test results corresponding to the best validation results with 5 different random seeds.

Comparison to Baselines
To compare with baselines, we separate a portion of core terms as queries for evaluation. Specifically, for each domain, we use 80% labeled terms for training, 10% for validation, and 10% for testing (with automatic annotation). Terms in the validation and testing sets are treated as fringe terms. By doing this, the evaluation can represent the general performance for all fringe terms to some extent. And the model comparison is fair since the rich information of terms for evaluation is not used in training. We also create a test set with careful human annotation on machine learning to support our overall evaluation, which contains 2000 terms, with half for evaluation and half for testing.
As evaluation metrics, we calculate both ROC-AUC and PR-AUC with automatic or manually created labels. ROC-AUC is the area under the receiver operating characteristic curve, and PR-AUC is the area under the precision-recall curve. If a model achieves higher values, most of the domainrelevant terms are ranked higher, which means the model has a better measurement on the domain relevance of terms. Table 2 and Table 3 show the results for three broad/narrow domains respectively. We observe our proposed CFL and HiCFL outperform all the baselines, and the standard deviations are low. Compared to MLP, CFL achieves much better performance benefiting from the core-anchored semantic graph and feature aggregation, which demonstrates the domain relevance can be bridged via term relevance. Compared to CFL, HiCFL works better owing to hierarchical learning.
In the PU setting-the situation when automatic annotation is not applied to the target domain, although only 20 positives are given, HiCFL still achieves satisfactory performance and significantly outperforms all the baselines (Table 4).
The PR-AUC scores on the manually created test    set without and with the PU setting are reported in Table 5. We observe that the results are generally consistent with results reported in Table 3 and  Table 4, which indicates the evaluation with core terms can work just as well.

Comparison to Human Performance
In this section, we aim to compare our model with human professionals in measuring the fine-grained domain relevance of terms. Because it is difficult for humans to assign a score representing do-ML-AI ML-CS AI-CS Human 0.698±0.087 0.846±0.074 0.716±0.115 HiCFL 0.854±0.017 0.932±0.007 0.768±0.023 main relevance directly, we generate term pairs as queries and let humans judge which one in a pair is more relevant to machine learning. Specifically, we create 100 ML-AI, ML-CS, and AI-CS pairs respectively. Taking ML-AI as an example, each query pair consists of an ML term and an AI term, and the judgment is considered right if the ML term is selected.
The human annotation is conducted by five senior students majoring in computer science and doing research related to terminology. Because there is no clear boundary between ML, AI, and CS, it is possible that a CS term is more relevant to machine learning than an AI term. However, the overall trend is that the higher the accuracy, the better the performance. From Table 6, we observe that HiCFL far outperforms human performance. The depth of the background color indicates the domain relevance. The darker the color, the higher the domain relevance (annotated by the authors); * indicates the term is a core term, otherwise it is a fringe term.   Although we have reduced the difficulty, the task is still very challenging for human professionals.

Case Studies
We interpret our results by ranking terms according to their domain relevance regarding machine learning or deep learning, with hierarchy CS → AI → ML → DL. For CS-ML, we label terms with automatic annotation. For DL, we create 10 DL terms manually as the positives for PU learning. Table 7 and Table 8 show the ranking results (1-10 represents terms ranked 1st to 10th). We observe the performance is satisfactory. For ML, important concepts such as supervised learning, unsupervised learning, and deep learning are ranked very high. Also, terms ranked before 1010th are all good domain-relevant terms. For DL, although only 10 positives are provided, the ranking results are quite impressive. E.g., unlabeled positive terms like artificial neural network, generative adversarial network, and neural architecture search are ranked very high. Besides, terms ranked 101st to 110th are all highly relevant to DL, and terms ranked 1001st to 1010th are related to ML.

Conclusion
We introduce and study the fine-grained domain relevance of terms-an important property of terms that has not been carefully studied before. We propose a hierarchical core-fringe domain relevance learning approach, which can cover almost all terms in human languages and various domains, while requires little or even no human annotation.
We believe this work will inspire an automated solution for knowledge management and help a wide range of downstream applications in natural language processing. It is also interesting to integrate our methods to more challenging tasks, for example, to characterize more complex properties of terms even understand terms.