ChemNER: Fine-Grained Chemistry Named Entity Recognition with Ontology-Guided Distant Supervision

Scientific literature analysis needs fine-grained named entity recognition (NER) to provide a wide range of information for scientific discovery. For example, chemistry research needs to study dozens to hundreds of distinct, fine-grained entity types, making consistent and accurate annotation difficult even for crowds of domain experts. On the other hand, domain-specific ontologies and knowledge bases (KBs) can be easily accessed, constructed, or integrated, which makes distant supervision realistic for fine-grained chemistry NER. In distant supervision, training labels are generated by matching mentions in a document with the concepts in the knowledge bases (KBs). However, this kind of KB-matching suffers from two major challenges: incomplete annotation and noisy annotation. We propose ChemNER, an ontology-guided, distantly-supervised method for fine-grained chemistry NER to tackle these challenges. It leverages the chemistry type ontology structure to generate distant labels with novel methods of flexible KB-matching and ontology-guided multi-type disambiguation. It significantly improves the distant label generation for the subsequent sequence labeling model training. We also provide an expert-labeled, chemistry NER dataset with 62 fine-grained chemistry types (e.g., chemical compounds and chemical reactions). Experimental results show that ChemNER is highly effective, outperforming substantially the state-of-the-art NER methods (with .25 absolute F1 score improvement).

In real-world applications, it is important to recognize chemistry entities on diverse and finegrained types (e.g., "inorganic phophorus compounds", "coupling reactions" and "catalysts") to provide a wide range of information for scientific discovery. It will need dozens to hundreds of distinct types, making consistent and accurate annotation difficult even for domain experts. On the other hand, the domain-specific ontologies and knowledge bases (KBs) can be easily accessed, constructed, or integrated, which makes distant supervision realistic for fine-grained chemistry NER.
Still, challenges exist for correctly recognizing the entity boundaries and accurately typing entities with distant supervision. In distant supervision, training labels are generated by matching the mentions in a document with the concepts in the knowledge bases (KBs). However, this kind of KB-matching suffers from two major challenges: (1) incomplete annotation where a mention in a document can be matched only partially or missed completely due to an incomplete coverage of the KBs (Figure 1a), and (2) noisy annotation where a mention can be erroneously matched due to the potential matching of multiple entity types in the KBs (Figure 1b). Due to the complex name structures (e.g., nested naming structures and long chemical With these chiral nucleophiles, Suzuki-Miyaura cross-coupling reactions were carried out with various aryl-and hetaryl chlorides in good to excellent yields.

CHLORIDES CHLORIDES
The boronic acids are predominantly aryl with only four vinyl boronic acids being used in the library.

(a) Incomplete Annotation
With these chiral nucleophiles, Suzuki-Miyaura cross-coupling reactions were carried out with various aryl-and hetaryl chlorides in good to excellent yields.

CHLORIDES CHLORIDES
Although it was necessary to employ a stoichiometric quantity of palladium , it is noteworthy that the cross-coupling proceeded in the presence of a wide array of functional groups. COUPLING  formulas) of chemical entities, these challenges lead to severe low-precision and low-recall for finegrained chemistry NER with distant supervision. Several studies have attempted to address the incomplete annotation problem in distantlysupervised NER. For example, AutoNER (Shang et al., 2018b) introduces an "unknown" type that can be skipped during training to reduce the effect of false negative labeling with distant supervision. BOND (Liang et al., 2020) leverages the power of pre-trained language models and a self-training approach to iteratively incorporate more training labels and improve the NER performance. However, previous methods assume a high precision and reasonable coverage of KB-matching for distant label generation. For example, the KB-matching on the CoNLL03 dataset (Liang et al., 2020) reported over 80% on precision and over 60% on recall. These methods do not work well with fine-grained chemistry NER that has severe low precision and low recall with KB-matching. Previous studies also largely ignore the noisy annotation problem by simply discarding those multi-labels during the KBmatching process (Liang et al., 2020). However, the noisy labels cannot be simply ignored for the chemistry entities because they consist of a large portion of distant training labels. We observe that more than 60% of the entities have multiple labels during KB-matching in the chemistry domain.
We propose CHEMNER, an ontology-guided, distantly-supervised NER method for fine-grained chemistry NER. Taking an input corpus, a chemistry type ontology and associated entity dictionaries collected from the KBs, we develop a novel flexible KB-matching method with TF-IDF-based majority voting to resolve the incomplete annota-tion problem. Then we develop a novel ontologyguided multi-type disambiguation method to resolve the noisy annotation problem. Taking the output from the above two steps as distant supervision, we further train a sequence labeling model to cover additional entities. CHEMNER significantly improves the distant label generation for the subsequent NER model training. We also provide an expert-labeled, chemistry NER dataset with 62 finegrained chemistry types (e.g., chemical compounds and chemical reactions). Experimental results show that CHEMNER is highly effective, achieving substantially better performance (with .25 absolute F1 score improvement) compared with the state-ofthe-art NER methods. We have released our data and code to benefit future studies 1 .

Related Work
Distantly-Supervised NER. Aiming to reduce expensive manual annotation, distant supervision has been used to generate training labels automatically by utilizing the entity information from existing KBs. The major research efforts lie in dealing with the incomplete annotation problem caused by an incomplete coverage of the KBs (Fries et al., 2017;Shang et al., 2018b;Peng et al., 2019;Wang et al., 2019aWang et al., , 2020aLiang et al., 2020).
AutoNER (Shang et al., 2018b) proposes a "tieor-break" tagging scheme to leverage distant supervision from entity dictionaries. Compared with the traditional "BIOES" tagging scheme, the "tie-orbreak" tagging scheme introduces an "unknown" type that can be skipped during training to reduce the effect of false negative labeling brought by the incomplete KB-matching. However, AutoPhrase often misses low-frequency phrases for the "unknown" entity generation using a phrase mining method AutoPhrase (Shang et al., 2018a). Positive and unlabeled learning (PU-learning) is used in distantly-supervised NER to provide an unbiased and consistent estimator of the objective function (Peng et al., 2019). However, there are two limitations in using PU-learning for distantly-supervised NER. First, PU-learning uses the prior distribution for each entity type, a parameter that is estimated from an existing human-annotated test set that is not always available for new entity types. Second, the performance of PU-learning is highly sensitive to the class-imbalance rate for each entity type, a Knowledge Bases S1: [Methyl-14C]S-dThd was synthesized by rapid methylation of ... S2: ... Suzuki-Miyaura cross-coupling reactions were carried out … S3: Although it was necessary to employ a stoichiometric quantity of palladium , it is noteworthy that the cross-coupling proceeded in the presence of a wide array of functional groups. S4: … can undergo a transmetalation with either BBA or the rapidly forming boronic acid …

COUPLING REACTIONS
parameter that is heuristically determined. It is difficult to apply PU-learning to distantly-supervised NER tasks on new entity types in new domains due to the above two limitations. BOND (Liang et al., 2020) leverages the power of pre-trained language models (e.g., BERT and RoBERTa) and a selftraining approach to iteratively incorporate more training labels and improve the NER performance. However, they do not work well with fine-grained chemistry entities that have a severe low-precision and low-recall problem with KB-matching. They also largely ignore the noisy annotation problem by simply discarding those multi-labels during the KB-matching process.
Other Related Tasks. One similar task to finegrained NER is entity linking (Francis-Landau et al., 2016;Gupta et al., 2017;Raiman and Raiman, 2018;Le and Titov, 2018) that maps a candidate entity in the text to a concept identifier in the knowledge bases. However, entity linking cannot deal with new entities that do not exist in the background knowledge bases. Another similar task is fine-grained entity typing (FET) (Hoffart et al., 2011;Yosef et al., 2012;Ling and Weld, 2012;Del Corro et al., 2015;Ren et al., 2015;Choi et al., 2018) that has been extensively studied in the general domain. FET aims at classifying an entity mention into a wide range of entity types by disambiguating the pre-identified entity mentions into a set of candidate entity types. It is formulated as a multi-class, multi-label classification problem and does not assume type exclusiveness. The fine-grained NER task targets both entity boundary detection and entity type recognition and assumes each entity to be tagged with only one type in a given context. In this study, we focus on the finegrained NER task in the chemistry domain.

The CHEMNER Framework
We propose CHEMNER, an ontology-guided distantly-supervised NER method for fine-grained chemistry NER ( Figure 2). It includes distant label generation (entity span detection, flexible KBmatching, and ontology-guided multi-type disambiguation) and sequence labeling model training.

Data Preparation
The input to CHEMNER includes two parts: (1) a chemistry literature corpus, and (2) Figure 3: Illustration of the chemistry type ontology construction and dictionary collection.
with each category as the entity dictionary for each type. We further remove irrelevant types and merge some fine-grained types to their coarse-grained parent types based on their term frequencies in the corpus. We also expand the entity dictionaries with synonyms collected from the PubChem knowledge base. Finally, we obtained a fine-grained chemistry entity type ontology with 62 types and its associated dictionaries with 10,551 entities. Figure  3 shows a subset of our chemistry type ontology. The complete fine-grained chemistry type ontology with 62 types can be found in Appendix A.1.

Flexible KB-Matching
Taking the input corpus, chemistry type ontology and associated entity dictionaries collected from the KBs, we first develop a flexible KB-matching method to resolve the incomplete annotation problem. Chemistry entities usually have complex name structures, such as nested naming structures (e.g., "aryl chloride" where "aryl" is a FUCNTIONAL GROUP, "chloride" is a HALIDE but altogether is an ORGANOHALIDE) and long chemical formulas (e.g., "Methyl 3'-(((Trifluoromethyl)sulfonyl)oxy)-[1,1'-biphenyl]-4-carboxylate"), that are quite flexible and cannot be fully covered by the KBs. Simple KB-matching used in previous distantly-supervised NER methods (Shang et al., 2018b;Liang et al., 2020) cannot match those complex chemistry entities that do not exist in the KBs, which leads to a severe low precision and low recall for labeling the fine-grained chemistry entities. We propose to first conduct entity span detection with chemistry phrase chunking tools followed by a flexible KB-matching to resolve the incomplete KB-matching problem. We use two phrase chunking tools, ChemDataExtractor (Swain and Cole, 2016) and Genia Tagger (Tsuruoka and Tsujii, 2005), to generate candidate entity spans in the input corpus (e.g., in Figure 2 sentence S2, the phrase chunking tools find "Suzuki-Miyaura crosscoupling reactions" as a candidate entity span.) Based on the detected candidate entity spans, we develop a flexible KB-matching method with TF-IDF-based majority voting to resolve the incomplete annotation problem.
The flexible KB-matching method can match long and complex chemistry entities (e.g., chemical compounds) that do not exist in the KBs. Specifically, we label each candidate entity span by letting each word token in the entity span vote for several entity types that are most likely to involve this word token. For example, in Figure 2 sentence S1, "[Methyl-14C]S-Thd", which is short for "4'-[methyl-14C]thiothymidine" according to the original document, is an author-defined abbreviation that cannot be covered by the existing KBs. However, since "Methyl-" is a common functional group that is usually the prefix of the organic compounds, this word token in "[Methyl-14C]S-Thd" helps vote for the types "OR-GANIC COMPOUNDS" and "ORGANIC POLY-MERS". Another example is sentence S2, where three ("suzuki", "coupling", "reaction") out of the five word tokens in "Suzuki-Miyaura crosscoupling reactions" help vote for the type "COU-PLING REACTIONS".
Formally, let e = [w 1 , w 2 , . . . , w n ], w i ∈ V, where e denotes each candidate entity span, w i each word token in the entity span, and V the vocabulary. Let T denote the set of fine-grained types and D t the dictionary of entities for type t ∈ T . The TF-IDF score of each word token w for each entity type t ∈ T is calculated as follows: where f (w, D t ) denotes the frequency of the word token w appearing in the dictionary D t . We set a minimum TF-IDF threshold θ = 0.02 to eliminate the common words from voting for the entity types. Then we let each word token vote for several entity types that has the highest TF-IDF scores above the mininum TF-IDF threshold and generate the distant labels by taking the majority voting. Note that this step can generate multi-type labels for the candidate entity spans due to ties in the majority voting. We resolve this problem with an ontology-guided multi-type disambiguation method as the next step.

Ontology-Guided Multi-Type Disambiguation
Based on the output of flexible KB-matching and the chemistry type ontology structure, we develop an ontology-guided multi-type disambiguation method to resolve the noisy annotation problem. An intuition of multi-type disambiguation is that the entities in the same sentence, paragraph or document usually follow a focused topic. For example, if a sentence is talking about organic chemistry, the entities in this sentence are more likely to have types related to organic chemistry. Following this intuition and the chemistry type ontology structure (Section 3.1), we draw two insights for an automated multi-type disambiguation: (1) the entity types in one sentence are usually confined to one big branch on the chemistry type ontology (e.g., organic or inorganic chemistry), and (2) the type of an entity under local context should be close to the types of the surrounding entities in the same sentence on the chemistry type ontology. For example, in Figure 2, sentence S3 contains one entity "palladium" that has two candidate types: "CAT-ALYSTS" that falls under "CHEMICAL REAC-TIONS" and "TRANSITION METALS" that falls under "CHEMICAL ELEMENTS". By looking at its surrounding entities (e.g., "cross-coupling"), we see that the surrounding entity types (e.g., "COUPLING REACTIONS" for "cross-coupling") fall under the "ORGANIC REACTIONS" branch, which is also under the larger "CHEMICAL RE-ACTIONS" branch, on the type ontology. So the sentence S3 is likely talking about chemical reaction and "palladium" is more suitable to have a type "CATALYSTS" instead of "TRANSITION METALS" based on the local context. Formally, let s = [e 1 , e 2 , . . . , e n ], where s denotes a sentence and e i ith entity mention in it that has been assigned an initial label set T e i = {t 1 e i , . . . , t m e i } with flexible KB-matching. For an entity e i with multiple candidate types (|T e i | > 1) to be resolved, we calculate the inverse distance between this candidate type and the distribution of the surrounding types on the type ontology. The disambiguation score for each candidate type S d (t j e i ) is defined as follows: where lca(·, ·) denotes the lowest common ancestor of two types on the type ontology and dep(·) denotes the depth of the type on the type ontology. S d (t j e i ) ∈ (0, 1) and a larger score indicates that the candidate type t j e i is more likely to be the correct type for the entity e i in sentence s.
If the surrounding types in the sentence still draw ties for the candidate type resolution, we could further enlarge the scope to a few surrounding sentences, the paragraph, the document or the corpus. We introduce a corpus-level global popularity score for each type based on our experimental observations. As shown in Figure 2, we calculate the frequency of each type in our initially labeled corpus with flexible KB-matching. "CATALYSTS" is globally more popular with a frequency of 18,707 compared to "TRANSITION METALS" with a frequency of 9,618. The global popularity score for each candidate type S g (t j e i ) is defined as follows: where f c (·) denotes the frequency of the type in the flexible KB-matched corpus. S g (t j e i ) ∈ (0, 1] and a larger score indicates that the candidate type t j e i is more likely to be the correct type for the entity e i globally in the corpus. The final score S(t j e i ) of the candidate type t j e i is a combination of the local disambiguation score S d (t j e i ) and the global popularity score S g (t j e i ): We choose the type t j e i for the entity e i that has a highest score S(t j e i ) for multi-type disambiguation.

Sequence Labeling Models
The flexible KB-matching and multi-type disambiguation still rely on the signals from the KBs and ontologies, which cannot cover all the new entities in the corpus. Taken the output from the above two steps as distant supervision, we further train a sequence labeling model to solve the sparsity labeling problem. For example, in Figure 2 sentence 4, "BBA" is a new entity that cannot be labeled by flexible KB-matching since there is no obvious token-level signals. However, there is a "boronic acid" entity with the type "OXOACIDS" in its surrounding context. The sequence labeling models will be able to capture those context patterns such as "either ... or ..." that usually connect entities with similar types. Thus they are likely to recognize "BBA" with the type "OXOACIDS". Based on the distant labels generated by the flexible KB-matching and multi-type disambiguation, we train a sequence labeling model (e.g., RoBERTa, ChemBERTa) without any constraints on the type of model to use. The loss function is defined as: where h θ (·) is the output of the sequence labeling model and y is our generated distant label. This is equivalent to minimizing the cross-entropy error between the outputs of the sequence labeling model and our generated distant labels.

Dataset
We provide a chemistry NER dataset covering 62 fine-grained chemistry types such as chemical compounds and chemical reactions. This dataset can be used to benchmark distantly supervised NER methods for the fine-grained chemistry NER task. The input for training includes two parts: (1) a chemistry literature corpus with 69,806 unlabeled sentences, and (2) a chemistry type ontology with 62 fine-grained chemistry types and associated entity dictionaries for each type (Section 3.1). The test set contains 1,600 expert-annotated sentences on the fine-grained chemistry types. We use this test set to compare the performance of different NER methods in our experiments. We report the entity-level micro-precision, micro-recall, and micro-F1 scores 4 of each NER method on the human-annotated test set. More details of the dataset preparation can be found in Appendix A.1.

Baselines
We compare the performance of CHEMNER with different groups of baseline methods. More details of the paramter settings and runtime analysis of each model can be found in Appendix A.2. KB-Matching: This baseline is a simple string matching as (Peng et al., 2019). It is a greedy search algorithm that walks through a sentence trying to find the longest strings that match the entities in the dictionaries. For the strings matched with multiple types, we simply discard those multilabels as (Liang et al., 2020). KB-Matching (freq): This baseline is a simple improvement of KB-Matching. For the strings matched with multiple types, we choose the type that has the highest frequency in the corpus. BiLSTM-CRF: This baseline is the BiLSTM-CRF model (Ma and Hovy, 2016) that takes the results of KB-Matching (freq) as distant supervision. AutoNER: This baseline is the AutoNER model (Shang et al., 2018b) that directly takes the raw corpus and the dictionaries as the input. It has a builtin KB-matching algorithm that maximizes the total number of matched tokens on each sentence to generate distant supervision. For the strings matched with multiple types, it assigns equal probabilities to each candidate type during training. RoBERTa: This baseline is the RoBERTa model (Liu et al., 2019) that takes the results of KB-Matching (freq) as distant supervision. ChemBERTa: This baseline is the ChemBERTa model (Chithrananda et al., 2020) that takes the results of KB-Matching (freq) as distant supervision. The ChemBERTa language model is pre-trained on the SMILE strings of the chemical molecule structures instead of the chemistry corpus. To our knowledge, there is no domain-specific pre-trained language model on the chemistry corpus. BOND: This baseline is the BOND model (Liang et al., 2020) that takes the results of KB-Matching (freq) as distant supervision. The original distant supervision is our KB-Matching baseline according to the BOND paper. Here we use the improved KB-Matching (freq) baseline to give the BOND baseline an improved performance. CHEMNER F : This is an ablation model of CHEM-NER with the flexible KB-Matching only. For the strings matched with multiple types, we simply discard those multi-labels. CHEMNER FM : This is an ablation model of CHEMNER with the flexible KB-Matching and the ontology-guided multi-type resolution. CHEMNER BiLSTM-CRF : This is a variation of CHEMNER that takes the results of CHEMNER FM as distant supervision and trains a BiLSTM-CRF model for the final prediction. CHEMNER RoBERTa : This is a variation of CHEM-NER that takes the results of CHEMNER FM as

Experimental Results
Overall Results. Table 1 shows the overall results on the test set of our fine-grained chemistry NER dataset. CHEMNER achieves .25 absolute F1 score improvement over the best performing baseline model RoBERTa. As we have discussed, the KB-Matching method suffers from severe low precision (32%) and low recall (5%) for labeling the finegrained chemistry entities, which greatly limits the performance of the baseline NER methods that use KB-Matching for distant supervision.
Ablation Study.   tion, for fine-grained chemistry NER under distant supervision. The four full model variations further shows that RoBERTa is the best sequence labeling model that takes the output of CHEMNER FM as distant supervision.
Parameter Study. Table 3 shows the effect of different mininum TF-IDF threshold θ on the performance of CHEMNER F . This threshold θ is used to eliminate common word tokens from voting for the candidate entity types during the flexible KB-Matching. We observe that θ = 0.02 gives the best performance of of CHEMNER F . Table 4 shows the effect of different enlarged scopes on the performance of CHEMNER FM . This enlarged scope is used to control the performance of ontology-guided multi-type disambiguation. We observe that when the context types in one sentence still draw ties for multi-type disambiguation, it is more effective to directly go to the corpus-level to look at the popularity scores for each type instead of extending the ontology-guided multi-type disambiguation mechanism to the document level. Qualitative Analysis. Table 5 shows some example sentences from our test set. We compare the prediction results of CHEMNER with two baseline methods: KB-Matching and RoBERTa. We also show the prediction results of our ablation models, CHEMNER F and CHEMNER FM , to demonstrate the contribution of each component and how the CHEMNER full model achieves the best performance step by step.
KB-Matching can only match entities that exactly appear in the KB dictionaries, which often leads to incomplete or missing annotations. Based on the results of KB-Matching, RoBERTa learns to give one context-specific label for each entity. For example, in Sentence # 1, KB-Matching failed

CHEMNER
... two aryl chlorides ORGANOHALIDES can be coupled to one another without the isolation of the intermediate boronic acid OXOACIDS ...

Sentence # 2
The total synthesis of narciclasine ALKALOIDS is accomplished by the late-stage, amide-directed C-H hydroxylation ORGANIC REDOX REACTIONS ...

KB-Matching
The total synthesis of narciclasine FREE RADICALS, ALKALOIDS, BIOMOLECULES is accomplished by the latestage, amide-directed C-H hydroxylation ORGANIC REDOX REACTIONS ...

RoBERTa
The total synthesis of narciclasine BIOMOLECULES is accomplished by the late-stage, amide-directed C-H hydroxylation ORGANIC REDOX REACTIONS ...

CHEMNER F
The total synthesis of narciclasine ALKALOIDS, BIOMOLECULES is accomplished by the late-stage, amidedirected C-H hydroxylation ORGANIC REDOX REACTIONS ...

CHEMNER FM
The total synthesis of narciclasine ALKALOIDS is accomplished by the late-stage, amide-directed C-H hydroxylation ORGANIC REDOX REACTIONS ...

CHEMNER
The total synthesis of narciclasine ALKALOIDS is accomplished by the late-stage, amide-directed C-H hydroxylation ORGANIC REDOX REACTIONS ... to recognize "aryl chlorides" as a whole unit, yet it does match "aryl" to three types (i.e., "ARO-MATIC COMPOUNDS", "SUBSTITUENTS', and "FUNCTIONAL GROUPS"). RoBERTa learns the best label (i.e., "FUNCTIONAL GROUPS") for the multi-type entity (i.e., "aryl") based on the context. Although "FUNCTIONAL GROUPS" is indeed the best type for "aryl" if we look at the word individually, RoBERTa still achieves imperfect performance due to the incomplete boundaries inherited from KB-Matching.
With flexible KB-Matching, CHEMNER F detects the complete boundaries and assigns much more suitable types in most cases. Based on the results of CHEMNER F , using ontology-guided multi-type resolution, CHEMNER FM determines the context-specific label that fits the best. For example, in Sentence # 2, CHEMNER F matches "narciclasine" to two types (i.e., "ALKALOIDS" and " BIOMOLECULES"). Here "ALKALOIDS" is a more suitable type and can be detected by CHEMNER FM because "ALKALOIDS" and the context type "ORGANIC REDOX REACTIONS" are both under the ontology branch "ORGANIC CHEMISTRY". However, there are also a few cases that the ontology-guided multi-type resolutions are imperfect. For example, in Sentence # 1, CHEMNER FM choose the type "CHLO-RIDES" over "ORGANOHALIDES" for "aryl chlorides" because "CHLORIDES" and the context type "OXOACIDS" are both under the ontology branch "INORGANIC COMPOUNDS", whereas the ground truth label is just the opposite. This issue could further be resolved by the sequence labeling model trained on top of CHEMNER FM . For example, in Sentence # 1, CHEMNER finally chooses "ORGANOHALIDES" over "OXOACIDS" instead probably because the sequence labeling model captures the pattern on the co-occurrence of "ORGANOHALIDES" and "OXOACIDS". Interestingly, from the perspective of chemistry, organohalides and organoboron species (a sector of oxoacids) are the exact two couplers of the Suzuki Coupling reaction.

Conclusions and Future Work
We propose CHEMNER, an ontology-guided, distantly-supervised method for fine-grained chemistry NER. It leverages the chemistry type ontology structure to generate distant labels with novel methods of flexible KB-matching and ontology-guided multi-type disambiguation. We also provide an expert labeled, chemistry NER dataset with 62 finegrained chemistry types (e.g., chemical compounds and chemical reactions). Experimental results show that CHEMNER is highly effective, outperforming substantially the state-of-the-art NER methods on fine-grained chemistry NER. Although achieving great performance, there is still large room for improvement of CHEMNER. In the future, we plan to further refine and enrich the type ontology and incorporate more information in the dictionaries (e.g., chemical structures in the KBs) for a better NER performance. We also plan to apply our finegrained NER method to other scientific domains.

Ethics/Impact Statement
We provide an expert-labeled, chemistry NER dataset with 62 fine-grained chemistry types on 1,600 sentences. The text corpus is collected from an open-source chemistry database PubChem 5 . The entity types are collected from Wikipedia 6 . We recruited 5 undergraduate annotators from the Chemistry Department in our university. Each of the annotators is compensated at an hourly salary of $15. Annotators are voluntary participants who were aware of any risks of harm associated with their participation and had given their informed consents. Our project is subjected to the review of and approved by the IRB at our university. This dataset can be used to benchmark the named entity recognition performance on fine-grained chemistry NER, which contains 1,600 carefully annotated sentences. Each sentence is labeled with groundtruth entities with both the entity boundaries and the entity types. We ask three domain experts to annotate each sentence. We provide the annotators with an auto-complete drop-down menu consisting of our entity type vocabulary. Each pair of annotators reach a substantial agreement with a Fleiss's κ of 0.72. The conflicts among annotators are re-solved by another senior domain expert in the final annotated test set. We've described many characteristics of the dataset in Section 3.1. More details of the dataset and the steps taken during the data collection and preparation process can be found in Appendix A.1.

A.1 Dataset Preparation
We have released all of our data and code for future studies, including the chemistry literature corpus, fine-grained entity type ontology and associated dictionaries collected from Wikipedia-Chemistry, manually-annotated test set for NER performance evaluation, and the code of CHEMNER.
Corpus Collection. We collected a corpus for Suzuki Coupling reactions in the chemistry domain. Suzuki coupling is an important reaction for carbon-carbon bond formation in organic chemistry. Recent studies have focused on the Suzuki coupling reactions to build AI-driven systems for molecular discovery, synthetic strategy designing, and manufacturing. This corpus contains 4,608 papers that are retrieved from PubChem 7 with the query "Suzuki Coupling", among which 319 papers have the full-text and all have the title and abstract. There are in total 71,406 sentences in this corpus.
Dictionary Collection. We collected a finegrained chemistry entity type ontology from Wikipedia by treating category pages as types and the titles of the pages associated with each category as the entities for each type. We first conducted depth-first search (DFS) starting from the Chemistry category 8 and found that the search did not stop when one million categories had been visited, and it often happened that a category relevant to Chemistry has irrelevant children. Therefore, we decide to use a technical term list to filter out irrelevant categories. We collected a spell-checker dictionary (Azman, 2012) with over 104,000 technical chemistry terms, and dropped a category from the search if less than 20% of 1-grams in its name and the names of all its direct children were covered by the dictionary. The threshold of 20% was selected empirically. After this step, we obtained a fine-grained chemistry entity type ontology with 3,775 types and 101,415 entities. We future tailor the entity type ontology and their associated entities by removing some irrelevant types and merge some fine-grained types to their coarse-grained parent types based on their frequencies in our chemistry literature corpus. We also expand the entity dictionaries with synonyms collected from the Pub-  Chem knowledge base. Finally, we obtained a finegrained chemistry entity type ontology with 62 types and 10,551 entities. Figure 4 shows our complete chemistry entity type ontology.
Test Set Annotation. We randomly select 1,600 sentences from the corpus and ask three domain experts to annotate each sentence as our test sets. We leave the remaining sentences (69,806 sentences in the corpus) as the training set for distant supervision. We provide the annotators with an autocomplete drop-down menu consisting of our entity type vocabulary. Each pair of annotators reach a substantial agreement with a Fleiss's κ of 0.72. The conflicts among annotators are resolved by another senior domain expert in the final annotated test set.

A.2 Parameter Settings
Runtime with Parameters. We compared all sequence model we adopted during experiments. Our models are trained on a single NVIDIA Titan Xp (12GB) GPU. The details about the average runtime and the number of parameters are given in Table 6. All training hyperparameters follow their original implementation. BiLSTM-CRF. We used the code base of BiLSTM-CRF 9 . The hyperparameters are set to default values. We trained the BiLSTM-CRF on Suzuki Coupling data with 10 epoches with learning rate as 0.001, hidden dimension as 256, drop rate as 0.5 and use word embedding with dimension of 256. AutoNER. We adopted the code base from Au-toNER's original implementation 10 . The hyperparameters are set to default values. We trained AutoNER model on Suzuki Coupling data with 50 epoches and learning rate as 0.05, hidden dimension as 300, drop rate as 0.5 and use pretrained word embedding with dimension of 200. RoBERTa. We use the HuggingFace 11 Transform-9 https://github.com/Gxzzz/BiLSTM-CRF 10 https://github.com/shangjingbo1226/ AutoNER 11 https://github.com/huggingface/ transformers ers Python Interface to train the RoBERTa model on the Suzuki Coupling data using the roberta-base model with 10 epochs and a batch size of 32. The other hyperparameters are set to default values. ChemBERTa. For ChemBERTa also, we use the HuggingFace Transformers to train the BERT model on the Suzuki Coupling data using the seyonec/ChemBERTa-zinc-base-v1 model with 10 epochs and a batch size of 32. The other hyperparameters are set to default values. BOND. To train our Suzuki Coupling data using BOND, we use their publicly available code 12 that also uses the HuggingFace Transformers robertabase model as the base model for training. We train the model for 20 epochs with a learning rate of 2e-5. The other hyperparameters are set to default values.