Adaptive Ordered Information Extraction with Deep Reinforcement Learning

Information extraction (IE) has been studied extensively. The existing methods always follow a fixed extraction order for complex IE tasks with multiple elements to be extracted in one instance such as event extraction. However, we conduct experiments on several complex IE datasets and observe that different extraction orders can significantly affect the extraction results for a great portion of instances, and the ratio of sentences that are sensitive to extraction orders increases dramatically with the complexity of the IE task. Therefore, this paper proposes a novel adaptive ordered IE paradigm to find the optimal element extraction order for different instances, so as to achieve the best extraction results. We also propose an reinforcement learning (RL) based framework to generate optimal extraction order for each instance dynamically. Additionally, we propose a co-training framework adapted to RL to mitigate the exposure bias during the extractor training phase. Extensive experiments conducted on several public datasets demonstrate that our proposed method can beat previous methods and effectively improve the performance of various IE tasks, especially for complex ones.


Introduction
Information Extraction (IE) has been studied extensively over the past few decades (Grishman, 2019).With the rapid development of pre-trained language models, simple IE tasks such as named entity recognition (Nadeau and Sekine, 2007;Li et al., 2020a) have been well solved.However, complex IE tasks with multiple elements to be extracted such as relation extraction (Pawara et al., 2017) and event extraction (Hogenboom et al., 2011) still need further exploration.
Traditional IE methods always follow a static extraction manner, i.e. with a pre-defined fixed † Corresponding authors.
1 Resources of this paper can be found at https:// github.com/EZ-hwh/AutoExtraction(Riedel et al., 2010) 5000 476 9.52% NYT10-HRL (Takanobu et al., 2019) 4006 525 13.11% DuIE (Li et al., 2019) 15000 3618 24.12% DuEE (Li et al., 2020b) 1492 605 40.55% HacRED (Cheng et al., 2021) 1500 971 64.73% element order.For instance, in relation extraction task, (Xie et al., 2021) recognizes the relation and then extract the subject and object independently.(Luan et al., 2019) extracts all entity together and recognize the relation between them pair-wisely.(Wang et al., 2020) formulates joint extraction as a token pair linking problem, which follows implicit predefined order.(Lu et al., 2022) designs a unified structured extraction language to encode different IE structures, where the generation order defines the extraction order.These extraction-based IE methods assume that multiple elements of the same instance are independent of each other and thus will not affect each other's extraction results.However, we find that different extraction orders highly affect the extraction results.For example, in the A instance in Fig. 1, if we first extract the Bassoon Concerto in B-flat major, it will be difficult for succeeding element extraction because the entity is long-tail.Instead, if we extract Mozart first, it would be much easier to extract the concerto because Mozart appears frequently in the corpus and he composed lots of concerto.According to our observation experiments conducted on several popular relation extraction and event extraction datasets as listed in Table 1,  Based on the observations above, the static extraction paradigm following a pre-defined extraction order is insufficient to achieve reliable extraction.This motivates us to propose a dynamic extraction paradigm, which aims to assign an optimal element extraction order for each instance adaptively.It is nontrivial to dynamically find optimal extraction order for every instance in a dataset.On one hand, the optimal extraction order of an instance depends on not only its schema (relation or event type), but also the context of the sentence.On the other hand, multiple rounds of decisions are required to generate the optimal extraction order, where the decision of each step depends on not only the schema and sentence context, but also the extraction results of previous steps.
In this paper, we propose to adopt value-based reinforcement learning (Mnih et al., 2015) in determining the optimal extraction order for elements of an instance.Particularly, in deciding the next extraction element for an instance, every of its unextracted elements will be evaluated with a potential benefit score, which is calculated with a BERT-based model.Then, the one with the highest potential benefit score will be selected as the next extraction object.In addition, to mitigate the exposure bias that emerges during the RL-based extractor training phase, a co-training framework is adopted which can simulate the inference environment of the extraction order decision agent to help enhance its performance.It is worth mentioning that our method focuses on generating the optimal extraction order, which is model agnostic and can be applied to various extraction paradigms.
Our main contributions are summarized below: • First, we propose the extraction order assignment problem in complicated IE task, which can effectively affect the extractor and the extraction result.
• Second, we propose an RL based framework that dynamically generates an optimal extraction order for each sentence, which is model agnostic and can effectively guide the model towards better extraction performance.
• Third, we adopt a co-training framework for RL to alleviate the exposure bias from the extractor, which can simulate the inference environment of the extraction order decision agent to help enhance its performance.
• Fourth, the experiments conducted on several public datasets show that our method outperforms the state-of-the-art extraction models.

Related Work
Pipeline Information Extraction (IE) methods split the extraction process into several sub-tasks and optimize each of them.They rely on the task definition so the framework varies for different IE tasks.
For relation extraction task, (Wei et al., 2020;Xie et al., 2021;Li et al., 2021) gradually extract subject, relation and object from the sentence in different order.For event extraction task, (Yang et al., 2018;Sheng et al., 2021;Yang et al., 2019) first recognize the event type and trigger and then extract the arguments with sequence tagging model or machine comprehension model.Joint IE methods combine two or more extraction processes into one stage.Graph-based methods are the mainstream joint IE framework.They recognize the entity or text span and build a graph with co-reference, relation (Wadden et al., 2019;Luan et al., 2019), entity similarity (Xu et al., 2021) or sentence (co-occurrence) (Zhu et al., 2021).Through information propagation on the graph, they better encode the sentence and document and then decode the edge to build the final sub-graph.Generation-based IE methods are another paradigm for joint extraction, (Cabot and Navigli, 2021;Ye et al., 2021) for relation extraction task, (Zheng et al., 2019;Hsu et al., 2022;Du et al., 2021) for  (Lu et al., 2022) for unified information extraction, they all serialize the structured extraction results into sentence or pre-defined template in a fixed order.Apart from works above, (Wang et al., 2020;Shang et al., 2022) propose one-stage joint extraction framework that decode the subject and object simultaneously.
Recently, reinforcement learning (RL) has been applied to IE task.Information extraction was augmented by using RL to acquire and incorporate external evidence in (Narasimhan et al., 2016).(Feng et al., 2018;Wang, 2020) both train a RL agent for instance selecting to denoise training data obtained via distant supervision for relation extraction.(Takanobu et al., 2019) utilizes a hierarchical reinforcement learning framework to improve the connection between entity mentions and relation types.(Zeng et al., 2019) first considers extraction order of relational facts in a sentence and then trains a sequence-to-sequence model with RL.

Task Definition
Fig. 2 gives an example of complicated information extraction process, which first recognizes the schema and then extracts the argument arg i for the corresponding role role i .Generally speaking, the task can be split into two sub-tasks: relation (event) detection and entity (argument) extraction.And we formulate the second task as multi argument extraction task.Given an instance s, the relation/event type rel and the corresponding pre-defined schema < rel, role 1 , role 2 , ..., role n >, our goal is to find all the arguments in s and fill them in their corresponding roles in schema.

Solution Framework
In this work, we model the complicated information extraction as a multi-step argument extraction task.It this setting, only a role in the schema will be extracted from instance once.With the help of extractor that can extract the arguments given the additional information and role name, we extract all arguments and fill the roles step by step.
Though there are various roles in complicated IE task, the difficulty of extracting them are completely different.For example, the role Refer-ence_point_time in Fig 2 indicates the time in the context and it can be extracted without further information.Other roles like Supplier_Consumer, however, can not be identified with a single role name, so they should be scheduled for extraction later.We hope that the extractor can first extract the simplest role, then the next simplest one and etc.By incrementally adding the previously extracted information, the extractor can keep a good performance on extracting the difficult ones.
To achieve the goal, we need to arrange a reasonable extraction order for the extractor.However, it is hard to specify the whole extraction order once because it depends on not only the schema and context, but also the previous extracted arguments.So we regard the extraction order decision as a Markov decision process, where we can dynamically decide the extraction order in multi-round.Clearly, reinforcement learning is a natural way to handle this modeling.We adopt the double deep Q-network (DQN) with prioritized replay buffer as RL agent.

Framework 4.1 Extractor
To handle the extraction tasks with different extraction order, we have to use a powerful extractor.GlobalPointer, proposed by (Su et al., 2022), is an ideal choice as it can identify both nested and nonnested entities, and even output the scores of the entities.
We first construct the sequence consisting of the extracted elements, role name and the sentence.For an input sequence with N tokens, the BERT based encoder outputs a vector sequence Following the computation of attention matrix, we use a one-head self-attention to compute the matrix as the output of the decoder.More specifically, we first convert the vectors h i to vectors q i and k i with two linear layers.
where W and b are parameters of the linear layers.
Then we compute the scores of each spans with the with relative position embeddings (RoPE) proposed by Reformer (Kitaev et al., 2020).The transformation matrix R i , which compose of sine and cosine function, satisfy the property that R ⊤ i R j = R j−i .By introducing relative positions embeddings, GlobalPointer is more sensitive to the length and span of entities, so it can better distinguish real entities.
To better solve the label imbalance problem, we use the following loss to train our extraction model.
where P α is the head-tail set of all spans of query α, and Q α is the head-tail set of the rest valid spans in text.In the decoding phase, all spans t [i:j] for which s α (i, j) > 0 are considered as entities that match the conditions.
For better fitting into the setting of extraction tasks that extract entities under the conditions of the schema and role name, we enumerate all the extraction orders and match the corresponding conditions with the extraction results to construct a training instance.

MDP for extraction order
We regard the multi-role extraction order decision process as a Markov decision process (MDP).Fig. 3 shows the whole extraction process of an instance.In each step, the agent takes the instances and extracted arguments as input, and chooses a role unselected before as the action.The environment would take the selected role, and construct the input sequence for extractor.After collecting the extraction results to fill the role and extraction scores to assign the reward, the environment would transit to new state(s).After selecting all roles to be extracted in multiple rounds, we exchange the whole extraction history into structural output.

State
We use s t to denote the state of sentence x in extracting time step t.The state s t consists of the extraction schema S, the already extracted arguments ŷ<t and the sentence x.
The state describes the extracted element in the past step.In each step, the environment would take a role selected by the agent and extracts the corresponding arguments in the sentence with the help of extractor described in Section 4.1.
Action The action of the RL model is the next role to extract in an instance.Unlike the traditional RL environment, the action space in our model is continuously reduced at every time step.We restrict the initial action space A 0 to the set of roles in the schema S.After selecting role a 0 in time step 0, the extractor in environment will extract the argument and its confident score in s with the help of extractor.The action a 0 will be removed from the A 0 and derive the next action space A 1 .
The derivation of action space can be formalized as below.
Reward The reward is a vital important component of RL training, which is an incentive mechanism that encourages agent to better plan for the episode.For our task definition, there is a simple reward assignment.We can extract all the arguments in the sentences, and then assign a reward in terminated state to indicate whether the extracted tuple matches the golden label.But there is a transparent issue that it will majorly depends on the extractor we use.If the extractor is too strong, the results extracted following any extraction order are correct.
If the extractor is too weak, the results extracted following any extraction order are incorrect.In the cases described above, different extraction order can not affect the final reward.Therefore, to better distinguish the impact of different extraction orders on the extractor, we define the reward as the score of the extraction results by the extractor.Though the extracted results recognized by the extractor for a single step is the same, the score is different given the different condition according to the Section 4.1.We regard the score as the difficulty of extracting the argument from the sentence.An extracted argument with high score indicates that it is easy to be extracted.The reward of our RL environment can be described as below.
R(s, a) = Extractor score (a|s) where s stands for the state and a stands for the role that will be extracted chosen by the agent.

Double DQN
For traditional DQN (Mnih et al., 2013), the learning object follows the Bellman equation as below.In our task setting, the agent would choose a unextracted role and the environment would return the extracted argument with the corresponding score.Since the extractor will extract all the corresponding entities that meet the conditions at once, it is possible to extract zero to multiple answers at one time, and each extracted result will form a separated state.Due to the splitting of the state, we need to make corresponding adjustments to the original Learning object.Inspired by (Tavakoli et al., 2018), we introduce a new learning object adapted to the branching reinforcement learning environment by replacing the Q values of the next state with the average value of the next state's list.
(8) where S (s,a) is the set of the following states derived from the state s with action a, and γ is the discount factor.
To avoid suffering from the overestimation of the action values, we adopt the Double DQN (DDQN) algorithm (Van Hasselt et al., 2016) that uses the current Q-network to select the next greedy action, but evaluates it using the target.At the same time, to enable more efficient learning from the experience transitions, we adopt the prioritized experience replay buffer (Schaul et al., 2015) to replay important experience transitions.
We define the loss to be the expected value of the mean squared TD error.
where D denotes the prioritized experience replay buffer and y denotes the estimated valued by the target network.
To evaluate the value of (state, action) pair, we use Transformer based model as encoder that can take the state and action as input and encode the pair of every (state, action) pair.Specifically, we use BERT (Devlin et al., 2018) for English and RoBERTa (Liu et al., 2019)  To evaluate the Q(s, a) value for the corresponding state and action, we take h 0 , which is the encoded vector of the first token [CLS] as the representation of the state-action pair.The final output of the value evaluation module ŷ is define in Eq.
where W and b are trainable model parameters, representing weights and bias of the linear transformation.

Co-training framework
Beside the extraction order our agent decide would affect the final extraction order, the extractor in the environment also matters.The argument extracted in one step not only affect the extracted tuple, but also affect the decision of the agent.However, there is a big different between the training phase and inference phase.In the training phase, the agent explores in the environment that the extractor has a good performance, which will extract the argument with high score.In the inference phase, however, the capacity of the extractor would be reduced because of the migration of the dataset from training to testing.To confirm that the agent would works in inference phase, we need to ensure similarity between the training environment and the inference environment.
We proposed a Co-training framework as shown in Fig 4 to simulate the environment in the testing phase.We first split the training set into two piece, which are used to train two sub extractor.Then we crossover the sub dataset and sub extractor and build two training environments.In each environment, the extractor is trained on the other piece of sub training set and extracts arguments from the sentence that it never meets.We train two agent with two environment separately.By introducing the co-training framework, we confirm the same setting on the training and inference environment.In the inference phase, we can build the environment with the test set and the extractor trained on the whole training set.Through combining the decision that two agent make, the argument in the sentences can be extracted step by step.

Datasets
We evaluate our methods on several public and accessible complicated information extraction datasets, including NYT, NYT10-HRL, SKE21, HacRED, DuIE, DuEE, which are challenging for many novel extraction methods.We give brief introduction to these dataset in appendix.
We choose the exact match that an extracted relational triple (subject, relation, object) is regarded as correct only if the relation and the spans of both subject and object are correct.We report the standard micro precision (Prec.),recall (Reca.)and F1 scores for the relation extraction experiments.And for the event extraction task (only DuEE in our experiment), we report the word-level metrics, which considers the correctness of the arguments in word level.We give the detail of this metric in appendix.

Main Results
Because we only consider the extraction order assignment in every instance, we add a classification module to first recognize the relation in sentence, and then extract the subject and object with the extraction order RL agents assign.According to the result in Table 2, compared to the main stream relation extraction methods, ours method achieves improvement on both precision, recall and F1 score.

Extraction order
To demonstrate the effectiveness of our methods, we also conduct an experiment on different extraction order decision strategy on more challenging event argument extraction task.In this experiment, we mainly focus on the performance of extraction result in different extraction order, so ground-truth schema (relation) for every instance would be offered beforehand, we only test the correctness on the roles.We also provide the result of extracting arguments in a pre-defined order and random order, as baselines.Table 3 shows the result of different extraction order in different dataset, and our method achieve the best in every dataset.Compare to the standard relation extraction task, our method perform better on complex information extraction task (DuIE and DuEE).

Complicated Extraction settings
To further demonstrate the advantage of dynamically order extraction decision by RL, we conduct experiment with more complex extraction tasks, which contains more tuples or more arguments.For the former, we limit the minimum number of extraction tuples, and we limit the minimum number of extraction roles for the latter.Table 4 and 5 show that compared to fixed or-der extraction or random order extraction method, our framework has a more significant improvement over the original metrics.This is intuitive and reasonable, the extractor is more sensitive to the extraction order in more complex sentence.Besides, compare to the Table 4 and 5, we can find that our method improves the latter settings more significantly.It is because increasing the number of tuples does not increase the length of the extracted path, but only increase the difficulty of single-step extraction by the extractor.In contrast, the increase of the role number leads to an increase in the length of the extraction path, which makes the decision of extraction order more difficult.The results once again prove that the extraction order matters in complicated extraction.

Case Study
With taking RL agent into consideration, we can easily observe the extraction order in different instance.Table 6 show the instances that the extraction process in different sentence.Though two instances share the same event schema Product release, the RL agent assign different extraction order dynamically.The first sentence contains an obvious element of time, while the second does not, so our methods put the extraction order of time from the first to the last.The case strongly demonstrates the effectiveness of our method.

Conclusion
In epochs on the standard relation extraction task and 10 epochs for DuIE and DuEE.Additionally, the exploration parameter ϵ was initialized at 0.9 and the discount factor γ was set to 0.5 within the DQN framework.To encourage continuous exploration, the exploration rate became 0.9 times of its own in a certain steps until it reached 0.05.We calculated the necessary updating steps using the following formula.

C Word-level Metric
We offer the word-level precision, recall and F1 score calculation formula.First we calculate the F1 score of a single arguments.where Set p denotes the set of words in predict argument, and Set g denotes the set of words in ground truth argument.We set T extF 1 = 1 if both Set a and Set b are empty set.For every predict event E p and ground truth event E g , we calculate their match score through calculate their mean T extF 1 score.
Score(E p , E g ) = 1 N N T extF 1(arg p , arg g ) (15) where N denotes the number of the roles that event schema contains.
We separately calculate the precision for prediction event and recall for ground truth event.
P rec E (E p ) = max e∈Eg Score(E p , e) Reca E (E g ) = max e∈Eg Score(e, E g ) ( 16) where E p and E g are the predict and golden event set of the same instance.Finally, we regard the mean precision and recall of every prediction and golden annotations in the whole test set as our Prec.and Reca.

P rec
heart is a song composed by Sting and Dominic Miller.The Bassoon Concerto in B-flat major, is a bassoon concerto written in 1774 by Mozart.

Figure 1 :
Figure 1: An example of complicated information extraction with different extraction order.
a significant proportion of sentences are sensitive to the extraction order, i.e., different extraction order produces different extraction results.The ratio of sensitive sentences increases dramatically with the complexity of the task on different datasets.What's worse, the optimal extraction order may not be the same for different instances of the same relation or event type.For example, for the composer relation, we should extract the composer Mozart first in A instance, but extract the song Shape of my heart first in B instance in Fig 1 because of its frequent appearance in corpus.

Figure 2 :
Figure 2: An example of Complicated Information Extraction

Figure 4 :
Figure 4: Framework of co-training.Blue arrow represents the training data division process.Yellow arrow represents the extractor training process.Green arrow represents the RL agent training process.

Table 1 :
The statistic of sensitive instances on different datasets, where an instance is sensitive if different extraction orders produces different extraction results with the same model.
Algorithm 1: The full details of our training phase for the Double DQN agent with ϵ−greedy exploration Input :D-empty replay buffer; θ-initial network parameters; θ − -copy of θ Input :N b -training batch size; N − -target network replacement frequency 1 for epoch = 1, ..., E do 2Sample instances s, S from the dataset.3N step = #number of roles in the S 4 for t = 1, ..., N step do 5 p ← Random(0, 1) 6 if p < 1 − ϵ then 7 a t ← arg max a Q(s t , a; θ) 8 else 9 a t ← Random-Sample(A t ) 10 end 11 s t+1 , r t ← T ransition(s t , a t ) 12 Store transition (s t , a t , r t , s t+1 ) in D 13 Sample random mini batch of N b transitions (s t , a t , r t , s t+1 ) from D 14 if s t+1 = done then 15 y t = r t 16 else 17 y t = r t + γ max a ′ Q(s t+1 , a ′ ; θ − ) 18 end 19 Update parameter θ on the loss L(θ) = (y t − Q(s t , a t ; θ)) 2 20 Replace target parameters θ − ← θ every N − steps for Chinese.Formally, for an input state s t = [t 1 , t 2 , ..., t N ] and action a t = [a 1 , ..., a M ], where the action is the candidate extraction role name, we form the sequence x = [[CLS], a t , [SEP], s t , [SEP]].The BERT encoder converts these tokens into hidden vector [h 1 , h 2 , ..., h M +N ], where h i is a d-dimension vector and d = 768 in the Transformer based structure.

Table 3 :
Extraction Result on different dataset with different extraction order decision.

Table 4 :
Extraction Result on complicated extraction case with different extraction order decision.HacRED and SKE21 are both tested on at least 5 triples of the same relation.

Table 5 :
Extraction Result on complicated extraction case with different extraction order decision.DuIE is restricted in at least 3 roles and DuEE is restricted in at least 5 roles.

Table 6 :
Instance of extracting complicated schema through dynamically assigning extraction order.

Table 7 :
NYT (Riedel et al., 2010)is the very early version of NYT series dataset.It is based on the articles in New York Times, contains 66, 194 text and 24 types of relation.SKE21 SKE2019 is the largest Chinese dataset available for relation extraction publish by Baidu, which contains 194, 747 sentences for training.(Xieet al., 2021) manually labeled 1, 150 sentences from test set with 2, 765 annotated triples.Our experiments are conducted on single RTX3090 GPUs.All deep models, including the extraction model and decision model, are implemented using the PyTorch framework.We initialized the model with the bert-base-cased and chinese-roberta-wwmext respectively, training 10 epochs for both the extractor and classifier.As for the reinforcement learning module, we set the buffer size to 100,000 and target network update step at 200.We trained 5 Hyper-parameters for training extractor and agent model respectively. NYT