ReactIE: Enhancing Chemical Reaction Extraction with Weak Supervision

Structured chemical reaction information plays a vital role for chemists engaged in laboratory work and advanced endeavors such as computer-aided drug design. Despite the importance of extracting structured reactions from scientific literature, data annotation for this purpose is cost-prohibitive due to the significant labor required from domain experts. Consequently, the scarcity of sufficient training data poses an obstacle to the progress of related models in this domain. In this paper, we propose ReactIE, which combines two weakly supervised approaches for pre-training. Our method utilizes frequent patterns within the text as linguistic cues to identify specific characteristics of chemical reactions. Additionally, we adopt synthetic data from patent records as distant supervision to incorporate domain knowledge into the model. Experiments demonstrate that ReactIE achieves substantial improvements and outperforms all existing baselines.


Introduction
The integration of advanced Natural Language Processing (NLP) techniques in the field of chemistry has been gaining significant attention in both academia and industry (Wang et al., 2019;Fabian et al., 2020;Chithrananda et al., 2020).By formulating applications in chemistry as molecular representation (Shin et al., 2019;Wang et al., 2022a), information extraction (Vaucher et al., 2020;Wang et al., 2021Wang et al., , 2022b)), and text generation (Edwards et al., 2022) tasks, NLP approaches provide new avenues for effective understanding and analysis of chemical information.In particular, we focus on the chemical reaction extraction task, as it can serve as a valuable reference for chemists to conduct bench experiments (Guo et al., 2022).
Despite the abundance of text describing chemical reactions in the scientific literature, the conversion to a structured format remains a major challenge.One approach is the utilization of domain

Scientific Paper
… The methyl-substituted porphyrinogens (7e and 7f) were oxidized with chloranil, and mesounsubstituted porphyrinogens (7g and 7h) were oxidized with 0.1% aqueous FeCl 3 in CHCl 3 at room temperature to obtain 16π-conjugated systems 5e in 6%, 5f in 7%, 5g in 5%, and 5h in 4% yields.… experts to manually extract chemical reactions, resulting in several commercial reaction databases, such as Reaxys (Goodman, 2009) and SciFinder (Gabrielson, 2018).However, this method is associated with significant time and labor costs, as well as the issue of restricted access to these resources.Subsequently, research efforts concentrated on automated systems, including OPSIN (Lowe, 2012) and CHEMRXNBERT (Guo et al., 2022).OPSIN is a heuristic-based system that employs a complex set of rules to identify the reaction roles.While it is effective for well-formatted text, OPSIN's performance is limited in scientific literature due to its sensitivity to variations in language use.In contrast, Guo et al. (2022) obtained CHEMRXNBERT by pre-training with language modeling on chemistry journals, however, the model performance is constrained by the small size of the training set during fine-tuning.This raises the question of how to effectively utilize large-scale unlabeled data for this task, which remains an under-explored area.

Chemical
In this paper, we present REACTIE, a pre-trained model for chemical reaction extraction.In light of the clear gap between prevalent pre-training tasks and the applications in the field of chemistry, we propose two weakly supervised methods to construct synthetic data for pre-training.Intuitively, humans can infer certain roles in chemical reactions from linguistic cues.As shown in Figure 1, we can identify "5e" as the product from the semantic meaning of the phrase "to obtain 5e".To this end, we mine frequent patterns from texts as linguistic cues and inject them into the model.Furthermore, domain knowledge also plays a crucial role in this task.For example, the accurate identification of "chloranil" as a catalyst rather than a reactant in Figure 1 requires a deep understanding of related compounds.To address this, we incorporate domain knowledge into REACTIE by utilizing patent literature as distant supervision.By pre-training on these acquired synthetic data, REACTIE maintains consistency with downstream objectives.Experimentally, REACTIE achieves state-of-theart performance, improving F 1 scores by 14.9 and 2.9 on the two subtasks, respectively.Moreover, we conduct ablation studies to examine the contributions of the proposed methods.Fine-grained analyses are performed to investigate the effects of pre-training strategies on different reaction roles.Our findings suggest that linguistic cues are crucial for extracting products and numbers, while chemical knowledge plays an essential role in understanding catalysts, reactants, and reaction types.

Task Formulation
Given a text D, the goal of this task is to extract all the structured chemical reactions S in D, where each S ∈ S contains n role-argument pairs {(r 1 , a 1 ), • • • , (r n , a n )}.The roles are 8 pre-defined attributes in a chemical reaction, including product, reactant, catalyst, solvent, reaction type, temperature, and yield.Each S does not include the roles that are not present in the original text.Definitions for each role are included in Appendix A.

Workflow for IE System
From the perspective of the model, existing systems typically follow a two-step pipeline: 1) Product Extraction: In chemical reactions, the product is the central factor as the same reactants can yield varying products depending on the reaction conditions.Therefore, the IE systems first extract all the products in D to determine the number of chemical reactions, i.e., the number of S. This step can also be used to extract passages in a scientific paper that contain chemical reactions.
2) Role Extraction: Given the original text D and the specific product, the IE systems are required to capture the relationship between the entities in D and the product, extract the corresponding reaction roles, and output the final S.

Reformulation
Previous studies have defined this task as a sequence labeling problem1 .However, this approach could be inadequate in certain cases.For instance, the final argument may be an alias, abbreviation, or pronoun of a compound in D, or the necessary conversion of words should be made (as illustrated in Figure 1, "oxidized" → "oxidation").
In light of these limitations, we reformulate the chemical reaction extraction task as a Question Answering (QA) problem, utilizing the pre-trained generation model FLAN-T5 (Chung et al., 2022) as the backbone.For product extraction, the input question is "What are the products of the chemical reactions in the text?".For role extraction, such as catalyst, the corresponding question is "If the final product is X, what is the catalyst for this chemical reaction?".In this unified QA format, we present the pre-training stage of REACTIE as follows.

Pre-training for REACTIE
Given the clear discrepancy between prevalent pretraining tasks such as language modeling and the task of chemical reaction extraction, we propose two weakly supervised methods for constructing synthetic data to bridge this gap.
Linguistics-aware Data Construction Intuitively, it is possible for humans to infer certain properties of a chemical reaction, even without any prior knowledge of chemistry.As an example, consider the sentence "Treatment of 13 with lithium benzyl oxide in THF afforded the dihydroxybenzyl ester 15" (Dushin and Danishefsky, 1992).We can identify that "13" and "lithium benzyl" are the reactants, and "dihydroxybenzyl ester 15" is the end product, without knowing any specific compounds involved.This can be achieved by utilizing linguis- tic cues such as the semantics of phrases and the structure of sentences to extract the arguments.
Inspired by this, we leverage frequent patterns (Jiang et al., 2017) in the text that describes specific reaction roles as linguistic cues.Take product extraction as an example, we first replace the chemical with a special token "[Chem]" using CHEM-DATAEXTRACTOR (Swain and Cole, 2016) By merging the seed patterns in the first step with the enriched patterns, we can iteratively repeat the process and collect reliable data containing multiple linguistic cues.More examples and details can be found in Appendix B and Table 4.
Knowledge-aware Data Construction In addition to utilizing linguistic cues, a deep understanding of chemical reactions and terminology is imperative for accurately extracting information from texts.This is exemplified in the case presented in Figure 1, in which the roles of compounds such as "chloranil", "FeCl 3 " and "CHCl 3 " as reactants, catalysts, or solvents cannot be inferred without prior knowledge.In light of this, we propose the integration of domain knowledge into REACTIE through the synthetic data derived from patent records.
The text within patent documents is typically well-formatted, allowing for the extraction of structured chemical reactions through the well-designed rules incorporating multiple chemical principles and associated knowledge bases (Lowe, 2012).To utilize this, we adopt datasets extracted from the U.S. patent literature by OPSIN (Lowe, 2018) as our synthetic data.We focus on 4 reaction roles (product, reactant, catalyst, and solvent) that are most relevant to chemistry knowledge.
Training Paradigm The methods outlined above enable the acquisition of a substantial amount of synthetic data.We then proceed to conduct pretraining by building upon the FLAN-T5 model in a text-to-text format.The input contains questions q i specific to a reaction role r i and text D, and the output is the corresponding argument a i or "None".After pre-training, the unsupervised version of RE-ACTIE acquires the capability to extract structured chemical reactions.To further improve it, we also perform fine-tuning on an annotated dataset to attain a supervised version of REACTIE. in the text.This corpus is designed to evaluate two subtasks, product extraction, and role extraction.
Implementation Details We use "google/flan-t5large" as the backbone model in all experiments.
For linguistics-aware data construction, we perform 3 iterations on 18,894 chemical journals and end up with 92,371 paragraphs containing the linguistic cues of product, temperature, yield, and time.Other reaction roles are excluded because they do not have sufficient patterns to ensure the reliability of the data.For knowledge-aware data construction, excessively long (> 256 words) and short (< 8 words) texts, as well as samples where the arguments do not appear in the original text, are filtered to yield 100,000 data.We train REACTIE for 1 epoch with 0.1 label smoothing on a total of 192,371 samples.For both pre-training and finetuning, we set the batch size to 16 with 5e-5 as the learning rate.All results are the performance of the checkpoints selected by the dev set.

Experimental Results
Results for Product Extraction The first part of paper domain due to its sensitivity to language usage.In contrast, REACTIE demonstrates superior extraction capabilities after pre-training and outperforms the fully supervised BiLSTM (w/ CRF).
Under the supervised setting, REACTIE attains state-of-the-art performance with a significant margin, achieving a 14.9 increase in F 1 scores compared to CHEMBERT.While our backbone model, FLANT5, shows outstanding results, our proposed methods can lead to further gains (85.5 ⇒ 91.1 F 1 ).Ablation studies highlight the importance of linguistics-aware pre-training over in-domain knowledge in the product extraction subtask.This finding also supports the advantages of pre-trained language models (FLANT5) over domain-specific models (CHEMBERT), as the writers have provided sufficient linguistic cues for the products of chemical reactions when describing them.

Results for Role Extraction
As listed in Table 2, REACTIE also beats the previous best model CHEMRXNBERT by 2.9 F 1 score for the role extraction subtask.In comparison to the product, the accurate extraction of other reaction roles from the original text necessitates a greater level of indomain knowledge.Specifically, the model performance decreases slightly (81.6 ⇒ 80.6 F 1 ) when linguistics-aware pre-training is removed, and substantially by 4.4 (81.6 ⇒ 77.2 F 1 ) when knowledgeaware pre-training is no longer incorporated.The results of these two subtasks reveal that our proposed approaches are complementary and indispensable in enabling REACTIE to fully comprehend chemical reactions.Together, they contribute to a deeper understanding of the task from both linguistic and chemical knowledge perspectives.
Analysis for Reaction Roles To further investigate the effect of our pre-training strategies, we present ∆F 1 scores on different reaction roles after equipping the two methods separately in Figure 3.We can observe that these two strategies assist the model by concentrating on distinct aspects of chemical reactions.Linguistic-aware pre-training primarily improves performance in reaction roles related to numbers, as these numbers tend to appear in fixed meta-patterns.In contrast, knowledgerelated pre-training significantly enhances the results of catalyst and reaction type, which require a chemical background for accurate identification.Overall, the combination of both approaches contributes to the exceptional performance of REAC-TIE in the chemical reaction extraction task.

Conclusion
In this paper, we present REACTIE, an automatic framework for extracting chemical reactions from the scientific literature.Our approach incorporates linguistic and chemical knowledge into the pre-training.Experiments show that REACTIE achieves state-of-the-art results by a large margin.

Limitations
We state the limitations of this paper from the following three aspects: 1) Regarding linguistics-aware data construction, we only perform seed-guided pattern enrichment for four reaction roles (product, yield, temperature, and time, see Table 4) due to the lack of sufficient reliable patterns for other roles.Incorporating more advanced pattern mining methods (Li et al., 2018;Chen et al., 2022) may alleviate this issue and discover more reliable linguistic cues, which we leave for future work.
2) As in the previous work, we adopt a fixed reaction scheme to extract structured chemical reaction information.However, there are always new informative roles in the text (Jiao et al., 2022), such as experimental procedures (Vaucher et al., 2021), so how to predict both roles and arguments without being limited to a fixed scheme could be a meaningful research topic.
3) REACTIE is capable of detecting chemical reactions within scientific literature by predicting if a given passage contains a product.However, accurate text segmentation of a paper remains an unresolved and crucial issue.Incomplete segmentation may result in the failure to fully extract reaction roles, while excessively long segmentation may negatively impact the model performance.Therefore, integrating a text segmentation module into the existing two-step pipeline may be the next stage in the chemical reaction extraction task.

A Reaction Scheme
We adopt the same reaction scheme as in the previous study, including 8 pre-defined reaction roles to cover the source chemicals, the outcome, and the conditions of a chemical reaction.To help better understand each reaction role, we include the detailed descriptions of the reaction scheme in Guo et al. (2022) as a reference in Table 3.
B Pattern Enrichment in Linguistics-aware Data Construction

Figure 1 :
Figure 1: An example of the chemical reaction extraction task.This figure depicts two out of the four chemical reactions present in the text for simplicity.The passage is drawn from Ahmad et al. (2015).

Figure 2 :
Figure 2: Overview of REACTIE.We propose linguistics-aware and knowledge-aware methods to construct synthetic data, thus bridging the gap between the objectives of pre-training and the chemical reaction extraction task.
, and then manually create a set of seed patterns, such as the produced [Chem], conversion of [Chem] to [Chem], etc.The red [Chem] indicates that the chemical here is the product of a reaction.As shown in Figure 2, based on seed patterns and a chemistry corpus, we construct synthetic data as: 1) Seed patterns are used to annotate the chemical corpus, resulting in training data containing labels.2) Continue training Flan-T5 in QA format on the data from the previous step.3) Use the QA model to re-label the entire corpus.4) The most frequent patterns are mined from the data in step 3 as the enriched pattern set.

Figure 3 :
Figure 3: The impact of two pre-training strategies on different chemical reaction roles.The Y-axis shows the F 1 improvement compared to the backbone model.

Table 1 :
Results for product extraction.The results presented in the gray background correspond to the performance of REACTIE and its ablation studies.

Table 2 :
Table 1 presents the results under the unsupervised setting.OPSIN performs poorly in the scientific Results for role extraction.

Table 3 :
Reaction scheme used in this paper.
Table4provides examples of seed and enriched patterns for the product, yield, temperature, and time.In each iteration, we extract n-grams(n = {2, • • • , 6}) containing the product ([Chem]), yield ([Num]), temperature ([Num]),and time ([Num]) from the corpus re-labeled by the QA model and remove the redundant patterns.We manually review and select reliable patterns and merge them into the pattern set of the previous iteration.