ChEMU-Ref: A Corpus for Modeling Anaphora Resolution in the Chemical Domain

Chemical patents contain rich coreference and bridging links, which are the target of this research. Specially, we introduce a novel annotation scheme, based on which we create the ChEMU-Ref dataset from reaction description snippets in English-language chemical patents. We propose a neural approach to anaphora resolution, which we show to achieve strong results, especially when jointly trained over coreference and bridging links.


Introduction
Chemical research has contributed greatly to human society and wellbeing, including new medicines and vaccines (Gwynne and Heabrer, 2015). Research is heavily reliant on knowledge of existing chemical processes and methods of chemical synthesis, which are documented in chemical research literature and chemical patents. Given the rapid growth of both publications and patents in chemistry, the need for automatic methods to extract semi-structured knowledge from chemical texts is becoming increasingly critical (Li et al., 2016;Akhondi et al., 2019).
Anaphora resolution is a key component of comprehensive information extraction (Rösiger, 2019;Poesio et al., 2016). In chemistry, different chemical compounds are mixed and reacted together in different ways to generate novel compounds, and to understand the precise chemical process often involves both resolving anaphoric references and understanding chemical changes/interactions a given entity is involved in. For example, as seen in Figure 1, while the final mention of mixture on line 3 and that on line 4 are both coreferent and chemically identical, in the case of mixture on line 2 and the first mention of mixture on line 3, the chemical composition is the same but a transformation has taken place via the stir and cool actions.
Our aim in this paper is to both identify anaphoric references in chemical patents, and determine the chemical relation between each linked pair of entities. We propose a domain-specific annotation framework based on five types of anaphora relations combining coreference and bridging. We then construct a dataset following this framework, annotated by chemical experts who achieve high inter-annotator agreement. We additionally extend existing anaphora resolution methods to model anaphora in chemical text, and compare both component-wise and joint models for anaphora resolution. This dataset will be released as part of the upcoming ChEMU 2021 shared task 1 (He et al., to appear).
Our contributions in this paper are as follows: (1) we propose a novel annotation scheme for anaphora resolution in chemical patents; (2) we develop a novel anaphora-resolution dataset based on chemical patents; and (3) we extend a general-purpose coreference resolution method, and achieve strong results via joint training over coreference and bridging with domain-specific fine-tuning.

Related Work
Anaphora occurs in two basic forms: coreference and bridging. Coreference occurs when different expressions in a text refer to the same entity in the real world (Ng, 2017;Clark and Manning, 2015), while bridging occurs between discrete entities that are linked via lexical semantic, frame-based, or encyclopedic relations (Asher and Lascarides, 1998;Hou et al., 2018).
The CoNLL-2012dataset (Pradhan et al., 2012) is a general corpus consisting of texts from three languages (English, Chinese, and Arabic). It is annotated based on OntoNotes v5.0 (Weischedel et al., 2013) and includes two types of coreference relations: IDENTITY, a symmetrical and transitive relation; and APPOSITIVE, two noun phrases that are adjacent and not linked by a copula. Coreference resolution is modelled as a clustering task. The wikicoref corpus was constructed with the same relations, over Wikipedia documents (Ghaddar and Langlais, 2016a).
BioNLP-ST 2011 (Nguyen et al., 2011) is a domain-specific coreference corpus over abstracts from biomedical literature, focusing mainly on gene-protein coreference, considering four relations: RELAT (relative pronouns or adjectives, e.g. which), PRON (pronouns, e.g. they), DNP (definite or demonstrative noun phrases marked with the, this, etc.), and APPOS (apposition). Instead of modelling coreference resolution as a clustering task, here the direction of coreference links is preserved. As this corpus focuses on gene-protein coreference, the range of coreference phenomena is limited. The CRAFT-CR corpus (Cohen et al., 2017) adds coreference relations to the Colorado Richly Annotated Full Text (CRAFT) corpus (Bada et al., 2012), following OntoNotes v5.0 with minor adaptions, and including discontinuous expressions, domain-specific proper nouns, and a broad range of mention types.
The definition of bridging is somewhat imprecise (Zeldes, 2017;Hou et al., 2018), and different corpora have adopted different definitions. Based on , there are two types of bridging: referential bridging, which can be treated as a context-based relation; and lexical bridging, which describes lexical-semantic relations such as holonymy and meronymy. Poesio et al. (2008) introduced the ARRAU corpus of general language texts for bridging, which consists of news, dialogue, and narrative text. In the corpus, entities are limited to noun phrases, and most bridging pairs are lexical relations, with only a small number of instances of referential bridging. ISnotes (Hou et al., 2018) includes 50 Wall Street Journal (WSJ) articles from the OntoNotes corpus, and has both coreference and bridging annotations, with most of the bridging pairs being referential. BASHI (Rösiger, 2018a) has both coreference and bridging annotations over 50 WSJ articles based on the OntoNotes v5.0 guidelines, with most bridging links once again being referential. Rösiger (2016) developed a corpus called SciCorp based on English scientific papers, following the same annotation scheme as BASHI.
Due to limited dataset availability, most research has modelled coreference resolution and bridging separately. There are two basic approaches to coreference resolution. First is mention ranking methods, which aim to score the coreferent probability of mention pairs (Clark andManning, 2015, 2016a,b;Wiseman et al., 2015Wiseman et al., , 2016, and make the assumption that mentions have been preidentified, meaning they are heavily reliant on upstream mention detection methods. Second is span ranking methods, which combine mention detection with coreference prediction (Lee et al., 2017Zhang et al., 2018;Grobol, 2019;Kantor and Globerson, 2019), and tend to perform better. Bridging methods can be grouped into: (1) rule-based methods (Hou et al., 2014;Rösiger, 2018b;; and (2) machine learning methods (Hou, 2018a(Hou, ,b, 2020Yu and Poesio, 2020). Rule-based methods have been shown to achieve competitive results on domain-specific corpora, but equally to be domain brittle. Yu and Poesio (2020) jointly trained a model for coreference resolution and bridging by adapting a span ranking method for coreference Kantor and Globerson, 2019), achieving good performance over various bridging corpora. However, they evaluated their model only on bridging.

Annotation Scheme
In this section, we introduce our annotation guidelines for anaphora resolution in chemical patents. The complete annotation guidelines are made available at Fang et al. (2021).

Corpus Selection
We build on the ChEMU corpus (Verspoor et al., 2020) developed for the ChEMU 2020 shared task (He et al., 2020). This corpus consists of 'snippets' extracted from chemical patents, where each snippet corresponds to a reaction description. It is common that several snippets are extracted from the same chemical patent.

Mention Type
We aim to capture anaphora in chemical patents, with a focus on identifying chemical compounds during the reaction process. Consistent with other anaphora corpora (Pradhan et al., 2012;Cohen et al., 2017;Ghaddar and Langlais, 2016b), only mentions that are involved in referring relationships (as defined in Section 3.3) and related to chemical compounds are annotated. The mention types that are considered for anaphora annotation are listed below.
It should be noted that verbs (e.g. mix, purify, distil) and descriptions that refer to events (e.g. the same process, step 5) are not annotated in this corpus.
Chemical Names: Chemical names are a critical component of chemical patents. We capture as atomic mentions the formal name of chemical compounds, e.g. N-[4-(benzoxazol-2-yl)methoxyphenyl]-S-methyl-N'-phenyl-isothiourea or 2-Chloro-4-hydroxy-phenylboronic acid. Chemical names often include nested chemical components, but for the purposes of our corpus, we consider chemical names to be atomic and don't annotate internal mentions. Hence 4-(benzoxazol-2yl)-methoxyphenyl and acid in the examples above will not be annotated as mentions, as they are part of larger chemical names.
Identifiers: In chemical patents, identifiers or labels may also be used to represent chemical compounds, in the form of uniquely-identifying sequences of numbers and letters such as 5i. These can be abbreviations of longer expressions incorporating that identifier that occur earlier in the text, such as chemical compound 5i, or may refer back to an exact chemical name with that identifier. Thus, the identifier is annotated as an atomic mention as well.
Phrases and Noun Types: Apart from chemical names and identifiers, chemical compounds are commonly presented as noun phrases (NPs). An NP consists of a noun or pronoun, and premodifiers; NPs are the most common type of compound expressions in chemical patents. Here we detail NPs that are related to compounds: • Pronouns: In chemical patents, pronouns (e.g. they or it) usually refer to a previouslymentioned chemical compounds. • Definite NPs: Commonly used to refer to chemical compounds, e.g. the solvent, the title compound, the mixture. Furthermore, there are a few types of NPs that need specific handling in chemical patents: • Quantified NPs: Chemical compounds are usually described with a quantity. NPs with quantities are considered as atomic mentions if the quantities are provided, e.g. 398.4 mg of the compound 1. • NPs with prepositions: Chemical NPs connected with prepositions (e.g. in, with, of ) should be considered as a single mention. For example, the appropriate amino derivative in dry THF is a single mention. NPs describing chemical equipment containing a compound may also be relevant to anaphora resolution. This generally occurs when the equipment that contains the compound undergoes a process that also affects the compound. Thus, mentions such as the flask and the autoclave can also be mentions if they are used to implicitly refer to a contained compound.
Relationship to ChEMU 2020 entities: Since this dataset is built on the ChEMU 2020 corpus (He et al., 2020), annotation of related chemical compounds is available by leveraging existing entity annotations introduced for the ChEMU 2020 named entity recognition (NER) task. However, there are some differences in the definitions of entities for the two tasks.
In the original ChEMU 2020 corpus, entity annotations identify chemical compounds (i.e. REACTION PRODUCT, STARTING MATERIAL, REAGENT CATALYST, SOLVENT, and OTHER COMPOUND), reaction conditions (i.e. TIME, TEMPERATURE), quantity information (i.e. YIELD PERCENT, YIELD OTHER), and example labels (i.e. EXAMPLE LABEL). There is overlap with our definition of mention for the labels relating to chemical compounds. However, in our annotation, chemical names are annotated along with additional quantity information, as we consider this information to be an integral part of the chemical compound description. Furthermore, the original entity annotations do not include generic expressions that co-refer with chemical compounds such as the mixture, the organic layer, or the filtrate, and neither do they include equipment descriptions.

Relation Types
Anaphora resolution subsumes both coreference and bridging. In the context of chemical patents, we define four sub-types of bridging, incorporating generic and chemical knowledge.
A referring mention which cannot be interpreted on its own, or an indirect mention, is called an anaphor, and the mention which it refers back to is called the antecedent. In relation annotation, we preserve the direction of the anaphoric relation, from the anaphor to the antecedent. Following similar assumptions in recent work, we restrict annotations to cases where the antecedent appears earlier in the text than the anaphor.

Coreference
Coreference is defined as expressions/mentions that refer to the same entity (Ng, 2017;Clark and Manning, 2015). In chemistry, identifying whether two mentions refer to the same entity needs to consider various chemical properties (e.g. temperature or pH). As such, for two mentions to be coreferent, they must share the same chemical properties. We consider two different cases of coreference: • Single Antecedents: the anaphor refers to a single antecedent. • Multiple Antecedents: the anaphor refers to multiple antecedents, e.g. in cases where multiple antecedents are combined to form a single mixture. It is possible for there to be ambiguity as to which mention of a given antecedent an anaphor refers to (where the mention is repeated); in these cases the closest mention is selected.

Bridging
As stated in Section 3.3.1, when we consider the anaphora relations, we take the chemical properties of the mention into consideration. Coreference is insufficient to cover all instances of anaphora in chemical patents, and bridging occurs frequently. We define four bridging types: TRANSFORMED: Links between chemical compounds that are initially based on the same components, but which have undergone a change in condition, such as pH or temperature. Such cases must be one-to-one relations (not one-to-many). As shown in Figure 1, the mixture in line 2 and the first-mentioned mixture in line 3 have the TRANS-FORMED relation, as they have the same chemical components but different chemical properties.

REACTION-ASSOCIATED:
The relationship between a chemical compound and its immediate source compounds is via a mixing process, where the source compounds retain their original chemical structure. This relation is one-to-many from the anaphor to the source compounds (antecedents). For example, the mixture in line 2 has REACTION-ASSOCIATED links to three mentions on line 1 that are combined to form it: (1) the solution of Compound (4) (0.815 g, 1.30 mmol) in THF (4.9 ml); (2) acetic acid (9.8 ml); and (3) water (4.9 ml)).
one-to-many relation, from the anaphor to the compounds (antecedents) that are used for the work-up process. As demonstrated in Figure 1, The combined organic layer in line 5 comes from the extraction of The mixture and ethyl acetate in line 4, and they are hence annotated as WORK-UP.
CONTAINED: A chemical compound is contained inside some equipment. It is a one-to-many relation from the anaphor (equipment) to the compounds (antecedents) that it contains. An example of this is a flask and the solution of Compound (4) (0.815 g, 1.30 mmol) in THF (4.9 ml) on line 1, where the compound is contained in the flask.

Task definition
Anaphora resolution can be decomposed into a twostep task: (1) mention detection; and (2) anaphora relation detection. For the evaluation of mention and relation detection, we use precision, recall and F1. One issue here is that, for coreference resolution, anaphors can link to multiple antecedents. Many coreference evaluation metrics (Moosavi and Strube, 2016;Recasens and Hovy, 2011;Luo, 2005) cannot deal with this since they model coreference resolution as a clustering task, where all related antecedents and anaphors occur in one cluster, and assume a given mention occurs in a unique cluster. Hence we adopt the approach to evaluation of Kim et al. (2012), scoring coreference from two perspectives: (1) surface coreference; and (2) atom coreference. Surface coreference considers whether the anaphor refers to the closest previous antecedent(s). Atom coreference considers whether the anaphor refers to the correct antecedent(s). Atom coreference links take the coreferent transitivity into consideration and can be generated from surface coreference links, which we use by default.
For the corpus annotation, we use the BRAT text annotation tool. 2 To date, 220 snippets have been annotated by two chemical experts, a PhD candidate and a final year bachelor student in Chemistry. Four rounds of annotation training were completed prior to beginning official annotation. In each round, the two annotators individually annotated the same 10 snippets (different across each round of annotation), and compared their annotations; annotation guidelines were then refined based on discussion. After several rounds of training, we achieved a high inner-annotator agreement of Krippendorff's α = 0.92 (Krippendorff, 2004) at the mention level, 3 and α = 0.84 for relations. In total, 1,500 snippets will be annotated in the final dataset that will be used in the ChEMU 2021 shared task.
The statistics of the current corpus, and train/dev/test set splits that form the basis of our experiments in this paper, are shown in Table 1. The dev and test partitions were both double annotated by the two expert annotators, with any disagreements merged by an adjudicator.

Methodology
We propose a joint neural model for anaphora resolution. 4 Similar to Yu and Poesio (2020), our model adopts an end-to-end neural conference resolution (Lee et al., 2017, as outlined in Figure 2. Assume the snippet has T tokens represented as vector X = {x 1 , ..., x T }, consisting of fixed pretrained word and character embeddings learned from a convolution neural network (CNN).
For mention candidate detection, we follow the assumption of , considering continuous tokens as a potential span and computing the span score (s m ) for each possible span. Specifically, span representation s i is obtained by the concatenation of output token representations (x * i ) from a bidirectional LSTM (BiLSTM) (Hochreiter and Schmidhuber, 1997), the syntactic head representation (h i ) obtained from an attention mecha- nism (Bahdanau et al., 2015), and a feature vector of the mention (φ(i)): and the span score s m (i) is computed as: where FFNN denotes a feed-forward neural network, and START(i) and END(i) represent the starting and ending token index for span i, respectively.
To reduce the number of spans considered, we use a beam of λT candidate mention spans. Inspired by Zhang et al. (2018), the mention loss is defined as: where: GOLD m is the set of gold mentions that are involved in anaphora relations. For anaphoric relation detection, a span pair embedding is obtained by the concatenation of each span embedding (s m (i), s m (j)) and the element-wise multiplication of the span embeddings (s m (i) • s m (j)) and a feature vector (φ(i, j)) for span pair i and j: As coreference and bridging are different, we consider them separately.
For coreference resolution, we follow  in optimizing the marginal log-likelihood of all correct antecedents for a given anaphor: where N is the number of candidate mentions; and Y (i) = { , 1, ..., i − 1} is the set of possible assignments for each y i , which represents a dummy antecedent and the numbers represent the proceeding spans. GOLD c (i) is the gold coreferent antecedents that span i refers to. If span i doesn't have a coreferent antecedent, GOLD c (i) = . P (y i ) is obtained via softmax over the antecedent scores s c for the corresponding anaphor: For bridging resolution, as we have four relations, we model it as a multiclass classification task for each span pair. We represent the bridging relation as a one-hot representation and introduce a new relation type NO-RELATION for span pairs that do not have a bridging relation. The loss for bridging is: where K c represents the number of bridging categories, y b (i, j, c) denotes the prediction of y b (i, j) under category c, and: where GOLD b (c) is the gold bridging relation under category c.  Table 2: Anaphora resolution results over the test dataset (%). Models are trained for "coreference", "bridging" or "joint train" (both tasks jointly). Models were trained over 10,000 epochs, and averaged over 3 runs with different random seeds. "F A " and "F R " denote the F1 score for anaphor and relation prediction, respectively.
The total loss is L = L mention + L ref , where: for bridging L coref + L bridging for joint training

Experiments
In this section, we detail our experiments. We use similar hyperparameters to . Specifically, we use GloVe embeddings (Pennington et al., 2014) with window size=2 for head word embeddings. For BiLSTM, GloVe embeddings with window size=10 and contextualized ELMo word representations (Peters et al., 2018) are used. Character embeddings are learned from a charactor CNN with windows of 3, 4, and 5 characters, each with 50 filters. For bridging prediction, the feed-forward neural networks are composed of two hidden layers with 150 dimensions and rectified linear units (Nair and Hinton, 2010).
We separate the gold mentions into those for coreference and bridging. For joint training, the gold mentions are combined. Table 2 presents the results. For coreference evaluation, given that the results in Table 2 indicate that the surface and atom coreference results are not substantially different, we use surface coreference as our primary evaluation metric in the remainder of this paper. For bridging evaluation, we consider the overall bridging result as our primary analysis.
Overall, the joint training configuration achieves 54.2% F 1 score for coreference resolution and 71.1% F 1 score for bridging, representing +0.8%  Table 3: Comparisons with providing oracle mention during training; results on test dataset, using surface scoring for coreference. "F A "= F1 for anaphor prediction; "F R "= F1 for relation prediction. and +1.2% F 1 score absolute improvement over the component-wise models. This indicates that joint training improves the performance of both tasks. Compared to bridging, the performance of anaphor detection in coreference resolution is lower, particularly in terms of recall, possibly because the data is sparser.
To investigate the contribution of each step (mention detection vs. relation detection), we experiment with providing oracle mentions during the training process. Table 3 shows that the performance of both tasks improves substantially with gold mentions. We achieve 82.1% F 1 score for relation prediction result under joint training, with +13.5% F 1 absolute score improvement. That is, further improvement at mention detection will improve resolution results.  Table 4: Comparison of different pretrained embeddings; results over test dataset, using surface scoring for coreference. "F A "= F1 for anaphor prediction; "F R "= F1 for relation prediction.
To determine the importance of domain finetuning, we also experiment with an ELMo model pretrained on a 1 billion word chemical patent corpus (Zhai et al., 2019), referred to as CHELMO. The experimental results are provided in Table 4. With CHELMO, the performance of anaphor detection and relation detection improve by +3.3% and +4.5% absolute F 1 score, respectively. We also plot model performance with increasing amounts of training data in Figure 3. While the model performance is starting to plateau, potential gains could be attained with more annotated data. The strong correlation between anaphor detection and relation detection is also self-evident in the graph.
To perform error analysis, we analysed the model errors on the dev dataset. As detailed in Table 1, the corpus contains discontinuous mentions. However, our proposed model only considers continuous spans, accounting for some of the low recall.
For coreference resolution, errors can be attributed to three primary phenomena: 1. Long-distance relations: as illustrated in Table 5 Ex 1, the title compound (360 mg, 1.05 mmol, 32%) refers to a compound at the beginning of the snippet; the model generally fails to capture such long-distance relations.
2. Multiple antecedents: as discussed in Section 3.3.1, an anaphor may have multiple antecedents, however the models predict a single antecedent for each anaphor.
3. Imbalance of coreference and bridging relations: bridging is more prevalent than coref-  For bridging, as shown in Table 2, the performance suffers from low recall in anaphor detection. Furthermore, the confusion matrix of fine-grained bridging relations in Figure 4 shows that the model achieves poor performance for REACTION-ASSOCIATED and WORK-UP relation prediction, both in terms of precision and recall.
We further investigated the over-prediction problem in bridging. As shown in Table 5 Ex 2, the reaction mixture in line 3 has a RELATION-ASSOCIATED link with The reaction mixture in line 2 and sodium borohydride (10 mg, 0.27 mmol). The model overpredicts additional links to the two additional compounds that are linked to the previous mention of The reaction mixture in line 2. The WORK-UP relation in Ex 5 is similar: the second-mentioned the organic layer links to the first-mentioned the organic layer and magnesium sulfate. The filtered material, chloroform and water should be linked with the first-mentioned the organic layer, but are linked to the second. Such errors result from individual span-pair predictions, making it hard to capture interactions between anaphors. Evaluating the antecedents simultane- Step D: Ethyl 7-chloro-6-(difluoromethyl)-2-(trifluoromethyl)pyrazolo[1,5 -a]pyridine-3-carboxylate . ... Purification (FCC, SiO2, eluting with nhexane:dichloromethane (2:1)) afforded the title compound (360 mg, 1.05 mmol, 32%) .
The reaction mixture was then heated at 50 o C. for 2.5 h. An additional portion of sodium borohydride (10 mg, 0.27 mmol) was added, and the reaction mixture was heated at 50 o C. for an additional 2 h...

3
... after 55.8 mg of 6-chloro-7-deazapurine and 191 mg of potassium carbonate were sequentially added into the reaction mixture , the reaction mixture was refluxed for about 36 hours and then cooled down at room temperature...

4
... In the same manner as in Synthesis Example 8 except for using 2.11 g of the intermediate 6 in place of the intermediate 21 and using 1.00 g of 4-bromobiphenyl in place of bromobenzene, 1.49 g (yield: 56%) of a white solid was obtained.,.

5
... The filtered material was extracted with chloroform and water , and then the organic layer was dried by using magnesium sulfate . Thereafter, the organic layer was distilled under reduced pressure...

6
... acetonitrile (150 mL) was added under ultrasonic to get a large amount of with precipitate . After suck filtration, the filter cake was washed with acetonitrile (20 mLx3) , dried in vacuum to obtain the title compound (1.52 g, 86.9 %) . ously may address this. There is room for improvement in our model's ability to model context. In Table 5 Ex 3, due to the expression add into, the first-mentioned the reaction mixture does not include the chemicals mentioned prior, unlike the first mention of the phrase in Ex 2.
There are several causes of false negatives: 1. Reaction description variation: Chemical reactions are usually described step by step, and our model performs well in this structure. However, only part of a reaction may be described. Table 5 Ex 4 illustrates chemical compounds that are listed without a process.
2. Abstract expressions: In Table 5 Ex 6, precipitate should have a WORK-UP relation with acetonitrile (150 mL), and the title compound ... with the filter cake; these are missed due to inadequate modelling of domain terminology.

Conclusion
We propose a novel annotation scheme for anaphora resolution in chemical patents. For our annotation, we incorporate generic and domainspecific knowledge to define coreference and bridging specific to the chemical domain, based on which we created the novel ChEMU-Ref dataset.
Our corpus analysis and inner-annotator agreement show the complexity of the task, as well as the high quality of annotation. We model anaphora resolution as two sub-tasks, mention detection and anaphora relation detection, and also propose a joint training model, which outperforms the separately-trained models. By incorporating embeddings pretrained on the chemical domain, we found that domain knowledge boosts performance.
With detailed error analysis, we also identified directions to further enhance performance.

A Additional Experimental Results
In the following tables, we provide detailed experiment results described in the main paper. Table 6 provides a full comparison of training with gold-standard oracle mentions per anaphora relation on the test dataset.  Table 6: Test results with gold-standard mentions during training. Models trained for "coreference", "bridging" or "joint train" (both tasks jointly). Models trained over 10,000 epochs; averaged over 3 runs with different random seeds. "F A " and "F R " denote the F1 score for anaphor and relation prediction, respectively.  Table 7: Results with different pretrained embeddings. "coreference", "bridging" and "joint training" represent models that are trained on the coreference resolution task, bridging task, and both tasks jointly, respectively. We train the models over 10,000 epochs, and averages over 3 runs with different random seeds. "F A " and "F R " denote the F1 score for anaphor and relation prediction, respectively.