NanoNER: Named Entity Recognition for Nanobiology Using Experts’ Knowledge and Distant Supervision

Here we present the training and evaluation of NanoNER, a Named Entity Recognition (NER) model for Nanobiology. NER consists in the identification of specific entities in spans of unstructured texts and is often a primary task in Natural Language Processing (NLP) and Information Extraction. The aim of our model is to recognise entities previously identified by domain experts as constituting the essential knowledge of the domain. Relying on ontologies, which provide us with a domain vocabulary and taxonomy, we implemented an iterative process enabling experts to determine the entities relevant to the domain at hand. We then delve into the potential of distant super-vision learning in NER, supporting how this method can increase the quantity of annotated data with minimal additional manpower. On our full corpus of 728 full-text nanobiology articles, containing more than 120k entity occurrences, NanoNER obtained a F1-score of 0.98 on the recognition of previously known entities. Our model also demonstrated its ability to discover new entities in the text, with precision scores ranging from 0.77 to 0.81. Ablation experiments further confirmed this and allowed us to assess the dependency of our approach on the external resources. It highlighted the dependency of the approach to the resource, while also confirming its ability to rediscover up to 30% of the ablated terms. This paper details the methodology employed, experimental design, and key findings, providing valuable insights and directions for future related researches on NER in specialized domain. Furthermore, since our approach require minimal man-power, we believe that it can be generalized to other specialized fields.


Introduction
As the volume of the scientific literature increases, the demand for NLP models able to deal with domain vocabulary and specific knowledge is becom-ing increasingly apparent.The NanoBubbles 1 project, from which the work presented here originates, aims at studying how, when and why science fails to correct itself.It focuses on the nanobiology domain and combines approaches from the natural sciences, natural language processing and social sciences.The field of nanobiology being characterized by both its multidisciplinarity and its high degree of specialization is a perfect example of the need for specialized tools.Thus, we must leverage methods from Natural Language Processing (NLP) to assist in the extraction of important information from a large number of articles.The main task of this paper is to train a Named Entity Recognition (NER) model in the field of nanobiology.
The primary task of Named Entity Recognition is to identify and classify specific entities (i.e.named entities) in a text.Compared to other fields, Biomedical NER (BMNER) is a particularly challenging problem, mainly due to the high cost of obtaining quality annotated data and the complexity of domain terminology.A famous example of a model able to perform BMNER is bioBERT (Lee et al., 2020), which is pre-trained on a large-scale corpus of biomedical text.It performs well on a standard set of biomedical benchmarks in several downstream tasks (e.g., NER, Relations Extraction, Q&A).To our knowledge, NER in the nanobiology domain remains an uncharted territory, as existing BMNER models are not trained to recognize entities of interest in this specific field.
Training an efficient NER model requires a large amount of annotated data, which is not easy to come by in specialized domains as the manual work it requires need to be carried out by fields experts.In our work for NER in the nanobiology field, we use distant supervision to alleviate for the lack of annotated data and thus allow the creation of a corpus of articles from a specialized domain large enough to train a NER model.Using BioBERT (Lee et al., 2020) as base model, this approach requires minimal human work.We believe that the approach we implemented, and describe here, is adapted to other scientific domain.
First, we harnessed existing nanobiology ontologies (i.e., the Nanoparticle ontology (Thomas et al., 2011) and eNanoMapper (Hastings et al., 2015)) for their concept hierarchy and vocabulary.Then, an iterative process took place with a team of domain experts, who determined the essential labels for our NER model and curated the vocabulary.A round of vocabulary extension, with expert curation, took place before the automatic annotation of the corpus.Ablation experiments were also implemented to measure the influence of the vocabulary coverage in our distant supervision setting.
In summary, the main contributions of this paper are as follows: 1. We have implemented a method to create annotated data for NER.It consists in an iterative process, involving ontology and corpus analysis followed by use of expert knowledge and their validation.This lead us to identify five labels, with vocabularies covering 1438 terms, that are highly relevant to nanobiology.
2. We created NanoNER, a NER model for nanobiology using a distant supervision learning approach and trained on automatically annotated entities in a corpus of 728 unlabelled full-text nanobiology articles.Detailed ablation experiments were conducted to evaluate the influence of the vocabulary coverage.
3. Finally, ablation experiments allowed us to estimate the dependency of our model to the annotation resource.We can effectively measure how well NanoNER is capable to generalize, i.e. its ability to (re)find entities not present in the training set, as well as the essential and minimal terms needed to obtain satisfactory results.

Related work
Existing BMNER solutions encompass early NER methods, such as dictionary matching or rule-based approaches, as well as supervised machine learning methods such as Markov models (Ponomareva et al., 2007a).Conditional Random Fields (CRFs) were then employed to perform BMNER (Ponomareva et al., 2007b;Friedrich et al., 2006).Unlike Markov models, CRFs can consider the characteristics of the entire input sequence, not just the current state.And Support Vector Machine (SVM) can be used in binary classification problems for NER tasks, such as determining whether a word is a named entity of a particular type (Ju et al., 2011).
Recently, deep learning approaches using large amounts of labeled data, such as models built on BioBERT (Lee et al., 2020), have achieved stateof-the-art results on BMNER.For instance, on the jnlpba (Huang et al., 2020) dataset, the KeBioLM model (Yuan et al., 2021) obtained a F1 score of 0.82 on recognizing entities relating to proteins, genes and cells.In the bc5cdr (Li et al., 2016) dataset, the BINDER (Zhang et al., 2022) model using a contrastive learning approach, achieved a F1 score of 0.91 on chemical and disease entities.However, BMNER presents specific difficulties.For instance, Dong et al. (2016) conducted an extensive study on electronic medical records and identified that such technical texts often contain a substantial amount of specialized terminology and knowledge, and frequently present issues such as spelling errors, abbreviations, and idiosyncratic terms, all of which add to the difficulty of the NER task.In this difficult setting, they proposed a method based on CNN (Convolutional Neural Networks) and Word2Vec for performing BMNER and managed to achieve a F1 score of 0.73.
To address the scarcity of annotated data in deep learning models, some weak supervision and distant supervision solutions have been proposed.Mintz et al. (2009) were among the pioneers of distant supervision learning, introducing this method in information and relation extraction tasks.Their goal was to extract relations between entities from a large amount of unlabeled text, using existing knowledge bases as distant supervision signals.Distant supervision was initially widely applied to relation extraction tasks and later extensively used in NER tasks.Distant supervision methods for NER have been validated in previous studies.Shang et al. (2018) revised the LSTM-CRF NER model of Lample et al. (2016) and utilized the MeSH database for chemical and disease entity research.Since the automatic annotation of a corpus tend to introduce noise in the training data, some methods have been proposed to reduce this effect (Meng et al., 2021), such as using early stopping or introducing the concept of pseudo-labels (Liang et al., 2020).Early stopping prevent over-fitting the model on the training data and fosters the learning of important features of the corpus.Pseudo-labeling data expand the training set by generating new labeled data that can then be used alongside existing datasets.
BMNER using ontologies and distant supervision have already been performed in the biomedical domain (Fries et al., 2017;Wang et al., 2021) and this type of approach could be generalized to any domain for which a semantic and lexical resource exists.These works used different technics to minimize the risk of noise propagation, e.g.filtering candidate annotations through heuristics based on part-of-speech analysis (Fries et al., 2017) or disambiguating ambiguous entities based on other entities present in the same context (Wang et al., 2021).In our work, we rely on domain experts at crucial steps: (1) determining the labels and then (2) filtering and validating the vocabulary of our annotation resource.
To the best of our knowledge, no one has yet proposed a BMNER model that meets the information mining needs of the nanobiology field.We thus aim at training a NER model, using minimal manpower, but which still meets experts requirements regarding the entities of interest of the domain.

Data preparation
Here we describe the essential resources for our work, the corpus and ontologies used, the expert work on selecting labels relevant to the domain at hand as well as the vocabulary associated, and the automatic annotation on the scientific articles.All codes necessary to replicate this study are available online2 .

Corpus
The corpus used in this study comprises 728 research articles focused on the field of nanobiology.The vast majority of these articles are written in English.In total, the corpus contains 158,283 sentences and 3,762,791 tokens.On average, each paper in the corpus consists of 217 sentences, and each sentence contains approximately 24 tokens.This extensive dataset provides a rich foundation for in-depth analysis and research in the field of nanobiology.The articles were first obtained in PDF format and the abstract and full text of each article was extracted using Grobid (Lopez, 2008(Lopez, -2023)).Parts of the documents that are not considered as the core of the articles were excluded (e.g.References, Acknowledgment, Appendix).

Ontology
As resources, we used the NanoParticle Ontology for cancer nanotechnology research (NPO) (Thomas et al., 2011) and eNanoMapper (ENM) (Hastings et al., 2015), which are the two main ontology in the field of nanobiology.As described in the ENM official documentation, ENM is an automatic extension of NPO and reuses several other ontologies including NPO, CHEMINF (Hastings et al., 2011), CHEBI (Degtyarenko et al., 2007) and ENVO (Buttigieg et al., 2013).The NPO possesses 1904 classes and 81 properties, while ENM contains over 25k classes, 697 individuals and 55 properties (August 2023).Since ENM is built automatically we used it as a secondary source to NPO, in order to minimize the risk of noise propagation.The ontologies were used in CSV format, where each concept in the ontology had a unique key, definition, synonyms, and parent key.These resources will be used for their subsumption relations and vocabularies, providing us with a taxonomy and lexical database.

Labels and vocabulary
To determine the labels our model will be trained to recognize, and their vocabulary, we used an iterative process of reducing the ontologies, expanding the obtained vocabulary and having every steps validated by domain experts.Because of the large number of concepts in the NPO and ENM ontologies, the difficulty of finding a focus to start with and the fact that our aim is to create an automatically labeled corpus, we first retained only the concepts that presented at least one occurrence in our corpus (i.e.≈ 30% of NPO's and ≈ 10% of ENM's).Concepts that have never appeared in the corpus were discarded, and subsumption relations within the ontology were reconstructed to obtain a reduced ontology.
Using these reduced ontologies, three domain experts (cf.Acknowledgements) examined their structures and the remaining concepts.
Together, they identified five labels as being the core concepts of interest to the field of nanobiology, namely Nanoparticle, Property, Material, Event and Technology.Table 1 presents a short description of each label, along with the core concepts combined under them (the number of their respective sub-concepts before expert selection is indicated between parenthesis) and a vocabulary extract in the last column.
The concepts corresponding to each label are taken as the root concept of ontology sub-trees.We then amalgamated all the terms under the root concepts with all of the terms of all of its respective sub-concepts to built the labels vocabulary.In any conflict between the NPO and ENM structure, NPO was prefered.The labels were subsequently subjected to a first detailed verification by the domain experts, who selected sub-concepts with relevant vocabulary only, which drastically downsized the number ().In addition to verifying each label's vocabulary, they encountered six specific cases of terms under Material that they thought should be moved under Nanoparticle: buckyball, carbon dot, surface group, dendrimer, liposome and fullerene.
Table 2 presents the characteristics of the labels vocabulary.Terms designates the vocabulary size for the label based on the ontologies lexicon.Vocabulary indicates the size of the extended vocabulary based on terminological variations retrieval (cf.below), which includes the original terms.Occurrences gives the raw frequency of all label's terms in our corpus.Also, since this was obtained by reducing ontologies, the Depth and Width columns give an insight of the shape of each sub-tree.
After the expert determined the labels and corresponding terms, we recorded the variants of all the terms using FASTR (Jacquemin et al., 1997).Given a list of terms and a corpus of texts, FASTR is able to extract the terminological variations using solely lexical, syntactical and meta-grammatical rules.This tool is also able to account for variations in word order and part-of-speech changes.It can deals with multi-word terms and is able to recognise variations in an expression (e.g.'molecular function' → 'functional roles of molecular').Although the results of FASTR seemed rather accurate at first, a second round of expert validation of the vocabulary took place.Out of 2,211 unique variations, experts reduced the number to 1,438 terms (i.e.65%) and thereby preserved the quality of the training data.

Automatic corpus annotation
We annotated the data for our distant supervision approach using Prodigy (Montani and team, 2023) under a research licence.The annotation follows the CoNLL2003 (Sang and Meulder, 2003) standard, which uses the BIO annotation format.The Occurrences column in Table 2 displays the number of annotation under each label in our corpus.

Experimental methods
The primary objective of our experiments is to test whether the model possesses good generalization capabilities, precision, and stability.Therefore, we designed three distinct ablation studies to evaluate how dependent our approach is to the labels vocabulary.

Exploring Existing Models
To identify every entity in the articles, we first examined the results of the SciBERT model (Beltagy et al., 2019).SciBERT is a widely pre-trained model for scientific articles, aiming at improving the expressivity of the model and save training time for downstream tasks.We manually annotated 646 "naive" entities (i.e."naive" meaning only distinguishing whether a span is an entity or not, not knowing which label the entity belongs to) related to the field of nanobiology in one article (Ma et al., 2016), and then tried to use SciBERT for "naive" entity recognition on plain text.
The result is that SciBERT can identify almost all entities in the article.Out of 646 entities related to the nanobiology field it can identify 638, which suggest a high recall capability (i.e.≈ 0.99 on the article tested).However, SciBERT identifies a large number of entities that would be false positives in the field of nanobiology.SciBERT identified a total of 2,976 entities, which gives 2,322 false positives that need to be filtered out suggesting a low precision value (i.e.≈ 0.21).Examples of these false positives are : nanoscience, construction, convergence, reduce, Hayakawa A way to eliminated these false positives would be to match them with an ontology.But this approach would lack several essential aspects: classification of entities into labels, possible confusion between concepts when trying to do so (e.g. in the ontology, dendrimer is originally present under the concepts Material and Nanoparticle), coverage of the ontology vocabulary and so on.Then, it does not eliminate the need for ontology reduction and expert involvement.
We also experimented with some existing models for BMNER in the Scispacy and Stanza li-  braries, but most of these are trained on specific corpora and entities, and perform poorly on NER tasks in the nanobiology field.

Ablation experiment design
In order to assess the dependency of the approach to the resource, as well as model generalization capabilities, we designed a set of ablation experiments.As detailed in Table 2, our five labels cover a list of 538 terms.Each term has varying numbers of variants, ranging from 1 to over 10, resulting in a total of 1438 different terms.In our ablation experiments, the terms were first randomly shuffled in each labels to minimize the risk of latent factors from affecting the experimental results, such as the terms being arranged in a specific pattern.Then, the label's vocabularies are divided into five equal parts, noted as folds A, B, C, D and E. Ablation of 33% of the terms were also implemented with folds F, G and H.
To create the training and test set for our ablation experiments, we selected the sentences based on the presence or absence of ablated entities.This was done in order to ensure that the model would not confuse excluded entities for negative examples during the training, and to later test its capabilities to retrieve entities not encountered before.As shown in Table 3

Results and Analysis
In this section, we first present the results of training the model on the full dataset, which performances aligned with our expectations.Then, we detail the two random ablation experiments we designed, reducing the data by 20% and 33% respectively.We noticed a significant fluctuation in the results of these experiments.Therefore, we specifically analyzed the 20% ablation experiment and based on these analyses, we proposed the hypothesis that including or excluding specific terms under our labels might have a significant impact on the precision and recall scores.Following this, we designed a new frequency-based 10% ablation experiment to explore this hypothesis.The results from this new experiment successfully validated our conjectures.

Training on the whole dataset
Initially, we trained the model on the complete dataset, carrying out a training with 5 and 20 epochs respectively.Given the consistency of the training and validation datasets, the nature of this experiment is closer to a straightforward wordmatching task.Results are displayed in Table 4.We found that the F1 score of the model reached a value 0.985, which is consistent with our expectations.Subsequently, we performed a deep analysis of the model's generalization ability.Our assumption was that it would be impossible to achieve 100% coverage of terms in the corpus, so we had the model re-annotate the corpus.Table 4 thus also presents the number of unique new entities (i.e.ignoring the number of occurrences) identified by NanoNER, the number of correctly labeled entities and the associated precision.Not cosidering the number of occurrences for these newly retrieved entities allows for a better estimation of the model generalization capabilities.NanoNER achieves an precision value on new entities roughly around 0.8.Additionally, we found that as the training epochs increased, the precision value on the newly found entities improved, but the number of new entities recognized decreased.This indicates that the number of training epochs can be chosen according to the intended use, giving priority to recall or precision values on never-before-encountered entities.

20% Ablation experiments
The primary objective of our ablation study is to further test the model generalization capabilities and dependency to the resource.We employed early stopping for training, setting the number of epochs to 1 and the batch size to 32.The training results are presented in Table 5.As we can see, the precision fluctuates around 0.79 and the gap between the highest and lowest precision values can be as high as 0.14.The impact of the vocabulary ablation is even more visible on the recall: with an average score of 0.54, it has degraded considerably compared to the initial training.This tend to indicates that the absence or presence of specific terms highly influence the quality of the trained model.To address this, we conducted further exploration in Section 5.4.1.

33% Ablation experiments
Next, we conducted experiments with a 33% ablation of the terms.The results in Table 6 are as expected: the precision value remained around 0.8, but the recall rate dropped even further.This is due to the higher number of terms excluded in 33% ablation experiments compared to the 20% ones.

Ablation experiments analyse
Firstly, we conducted a generalization analysis on the ablation experiments from folds A to E. We employed the same method as in the full data analysis: We initially listed all the deleted entities, and then had predictions made by the model on the entire corpus.Subsequently, we matched all the entities predicted by the model with the deleted entities.Next, we sought to analyze the variability between the folds.We hypothesize that certain labels are excessively difficult, thereby affecting the overall performance of the task, we then evaluated each label separately.Table 8 displays the average recall and precision values of the labels over the different folds (detailled evaluation is available in Appendix A. We found that Nanoparticle and Event have the most important variations in recall, while the variations in precision mostly concern Nanoparticle and Technique Most of the average recall values over different labels are close to 0.54, but differences in average precision is more important when comparing the different labels.Material and Property displays scores over 0.90, Nanoparticle is around 0.83, but Event and Technique have significantly lower precision values (0.74 and 0.65 respectively).These variations are analyzed as a result of the specific characteristics of the labels vocabularies.E.g.Event and Technique contains terms from the scientific language that are not specific to the nanobiology field, and thus carry a higher risk of confusion.We also observed specific differences between the different folds for specific labels.E.g. the recall for Nanoparticle is very high in fold A (i.e.0.82), but significantly lower in fold E (i.e. O.19).This suggests that certain words are highly important for specific labels.In fact, in fold E, the term nanoparticle was removed, which is not only a high-frequency term throughout the entire corpus, but also an essential word involved in different terms (e.g.nanoparticle, gold nanoparticle).

Label
We believe that these high-frequency terms may have a great impact on the training of the model.Therefore, removing these words during training might lead to a significant decline in the performance of the model.To explore this hypothesis, we decided to conduct a third ablation experiment based on terms of frequency.

10% Ablation experiments
We then sorted the terms according to their frequency in the corpus and conducted four more   9: in each label we tried removing 10% of the most frequent terms (with one experiment reintroducing the first term in each label), removing the 10% in the middle of the terms frequency and finally retaining only the 10% most frequent terms in each label.This approach limits the corpus exploitable as training and test sets, reflecting the distribution of the terms throughout the corpus.
Firstly, it appears that the frequency of the terms is a good indicator of the model dependency towards the annotation resource.Indeed, precision and recall values are impacted proportionally to the rank of the terms removed.Comparing the first and second rows also indicates that certain terms (i.e. the most frequent terms in each label) significantly impact the model's performance.These terms may play a critical role in the classification task, or they could provide substantial contextual information, helping the model understand other related terms.Regarding the third row's ablation experiment, although most of the corpus (99%) was kept to train the model, the F1 score is far lower than 0.985 when trained with the full data.This suggests that even terms with lower frequencies still significantly impact the model's performance.These less frequent terms might carry specific information crucial for the model to understand and classify the text.The fourth row's ablation experiment result indicates that retaining only the most common terms might lead the model to overly focus on these terms, overlooking other terms that may carry important information.This could be because these common terms contain a lot of generic information but lack some specific, category-targeted information.
6 Error analysis and improvement approaches

Sentence selection during ablation experiments
In our ablation experiments, we sometimes encounter a scenario where a sentence contains an ablated term and an other one that is not, and thus we would like to remove only one of them.In our experiments, we chose to exclude such cases to avoid having our model confuse ablated terms for negative examples.But other strategies could be adopted to tackle this, such as the masking or replacement of tokens.

Imbalanced dataset
As observed during the ablation experiment, our corpus is highly imbalanced.Some terms and labels appear more frequently than others.Detailed evaluations of individual labels in ablation experiments are given in Appendix A. This discrepancy might lead the model to over-learn from these highfrequency terms, thereby overlooking the importance of less frequent terms.During future optimization, this problem could be tackled by using techniques such as oversampling or undersampling to balance the number of samples across different categories.In a study on NER using Wikipedia, (Al-Rfou et al., 2015) adopted an approach that involves constructing a subset of the training corpus.This strategy ensures that the conditional distribution of specific entity classes remains unaltered when they are positive examples, thereby significantly enhancing the model's performance across multiple languages.

Vocabulary coverage in distant supervision
Our training and evaluation assume that the ground truth annotations are accurate, which may not be the case in a distant supervision framework.Thus there may be cases where the model is retrieving entities under the correct label, but that are considered false positives in our automatically annotated corpus.We tried to reduce this effect by employing FASTR (Jacquemin et al., 1997) to improve the coverage of our resource, but this required experts to filter out FASTR false positives.Also, their is some known variations that FASTR is not able to retrieve (e.g.'iron oxide nanoparticle' and 'silicon dot' are in Nanoparticle vocabulary, but their respective variations 'iron nanoparticle' and 'silicon dot' were not recognized).One solution would be to use a method known as knowledge distillation, which incrementally improves the model's performance through iterations, a training method used in the previously mentioned BOND (Liang et al., 2020) paper.By using a teacher model to generate pseudo-labels for training the student model, the training effectiveness of the model is improved through repeated iterations.Another solution is to manually annotate a sufficient portion of high-quality data and then use it as validation set for the model.

Conclusion
In this work, we have introduced NanoNER, a tool for Named Entity Recognition in the field of nanobiology.We designed an iterative process to determine the model labels and vocabulary using ontologies, domain experts and retrieving terminological variations.This resulted in five labels, covering 1,438 terms, that allow for the automatic annotation of our corpus in a distant supervision approach.Experiment analyses have demonstrated that our model can effectively identify entities of interest, both previously seen and new ones, in the field of nanobiology.Given the complexity and abundance of technical terms in the field, our method shows promising applications in nanobiology.
We believe that this approach can be applied as is on other scientific fields, as it require only an ontology (or taxonomy) resource and minimum man-power.This allow for the efficient training of NER models useful in downstream NLP tasks.
Ablation experiments showed a significant dependence of the model on the vocabulary used.In future work, we could attempt data augmentation on the dataset to reduce its imbalance and enhance the model's training performance.In addition, it is possible to use knowledge distillation for iterative model updates, which can reduce the false positive misjudgment during validation and improve the model's generalization capabilities.

Table 1 :
Description and examples of the chosen labels

Table 2 :
Labels vocabulary sizes , this resulted in training and test sets of various sizes, some being three time larger than others (e.g.test sets in folds D and C).For instance, the training set in Fold A is composed of 126.834 sentences not containing a single ablated

Table 3 :
Number of sentences in each Fold

Table 4 :
Table 7, the model could stably Models trained using the entire dataset

Table 8 :
Average performances on each label

Table 9 :
Models trained on 10% ablation rounds of ablation, as shown in Table

Table 10 :
Ablation evaluation on individual labels