Gradient Imitation Reinforcement Learning for Low Resource Relation Extraction

Low-resource Relation Extraction (LRE) aims to extract relation facts from limited labeled corpora when human annotation is scarce. Existing works either utilize self-training scheme to generate pseudo labels that will cause the gradual drift problem, or leverage meta-learning scheme which does not solicit feedback explicitly. To alleviate selection bias due to the lack of feedback loops in existing LRE learning paradigms, we developed a Gradient Imitation Reinforcement Learning method to encourage pseudo label data to imitate the gradient descent direction on labeled data and bootstrap its optimization capability through trial and error. We also propose a framework called GradLRE, which handles two major scenarios in low-resource relation extraction. Besides the scenario where unlabeled data is sufficient, GradLRE handles the situation where no unlabeled data is available, by exploiting a contextualized augmentation method to generate data. Experimental results on two public datasets demonstrate the effectiveness of GradLRE on low resource relation extraction when comparing with baselines.


Introduction
Relation Extraction (RE) aims to discover the semantic relation that holds between two entities and transforms massive corpus into structured triplets (entity head , relation, entity tail ). For example, from "A letter head was delivered to my office tail ...", we can extract a relation Entity-Destination between head and tail entities. Neural RE methods leverage high-quality annotated data or human curated knowledge bases to achieve decent results (Zeng et al., 2017;. However, these manually labeled data would be laborintensive to obtain. This motivates a Low Resource Figure 1: Gradient descent direction on labeled data (g l ) and unlabeled data with correct or incorrect pseudo label (g u , g u ).
Relation Extraction (LRE) task where annotations are scarce.
Lots of efforts are devoted to improve the model generalization ability beyond learning directly from existing, limited annotations. Distant Supervision methods leverage facts stored in external knowledge bases (KBs) to obtain annotated triplets as the supervision (Mintz et al., 2009;Zeng et al., 2015). However, these methods should make a strong assumption that two co-occurring entities convey KB relations regardless of specific contexts, which makes model generate relations based on contextless rules and limits the generalization ability. To leverage unlabeled data, Rosenberg et al. (2005) propose to assign pseudo labels on unlabeled data and leverage pseudo labels to iteratively improve the generalization capability of the model. However, during the training process, self-training models suffer from the gradual drift problem (Curran et al., 2007;Zhang et al., 2016) caused by noisy pseudo labels. Hu et al. (2021) alleviate the noise in pseudo labels by adopting a meta-learning scheme during pseudo label generation, then leveraging pseudo label selection and exploitation scheme to obtain high-confidence pseudo labels. However, when limited annotations are directly used during training, the trained models inevitably possesses selection bias towards, if not overfit on, limited labeled data, which impedes LRE models from further generalizing beyond the annotations.
To improve the generalization ability for LRE, we propose to use existing annotations as a guide-line instead of having them directly involved in training, as well as introducing an explicit feedback loop when consuming annotations. More specifically, we first encourage pseudo-labeled data to imitate labeled data on the gradient descent directions during the optimization process. We illustrate this idea in Figure 1. g l represents the average gradient descent direction on labeled data. g u and g u represent the correct and incorrect pseudo labels on unlabeled data, which guides the gradient descent direction in a positive/negative fashion (Du et al., 2018;Sariyildiz and Cinbis, 2019;. Based on how well the pseudo-labeled data mimics the instructive gradient descent direction obtained from limited labeled data, we then design a reward to quantify the behavior and aim to use the reward as an explicit feedback. This learnable setting can be naturally formulated into a reinforcement learning framework, which aims to learn an imitation policy that maximizes the reward through trial and error. When comparing with methods where annotations are directly used in the traditional learning schema, this formulation also allows a feedback mechanism and thus increases generalization ability beyond limited annotations. We name our method as Gradient Imitation Reinforcement Learning in this paper. We propose a framework called GradLRE, which integrates Gradient Imitation Reinforcement Learning and is able to handle two major scenarios in LRE: 1) a typical scenario when limited labeled data and large amounts of unlabeled data are available, and an extreme yet practical scenario where 2) even unlabeled data is absent: only limited labeled data is available. GradLRE handles the former scenario via pseudo labeling optimized through Gradient Imitation Reinforcement Learning and tackles the later scenario by using a Contextualized Data Augmentation module.
To summarize, the main contributions of this work are as follows: • We propose a gradient imitation reinforcement learning method that alleviates the bias from training directly with limited annotation, and encourages the RE model to effectively generalize beyond limited annotations.
• We develop a LRE framework GradLRE that handles two low-resource relation extraction scenarios by leveraging both Gradient Imitation Reinforcement Learning and Contextualized Data Augmentation.

Proposed Model
The proposed framework GradLRE consists of three modules: Relational Label Generator (RLG), Gradient Imitation Reinforcement Learning (GIRL) and Contextualized Data Augmentation (CDA). As illustrated in Figure 2, two low resource relation extraction scenarios are handled. For the first scenario where limited labeled data and large amounts of unlabeled data are available, the input of RLG is labeled data and unlabeled data. Labeled data consists of sentences and relation mentions: [Sentence, Entity 1 , Entity 2 , Relation]. For the second scenario where only limited labeled data is available, we adopt CDA to generate unlabeled data and utilize these unlabeled data the same way as in the first scenario. In a traditional self-training setting, we fine-tune RLG directly using the labeled data, and let RLG assign pseudo labels on unlabeled data as pseudolabeled data. However, we argue that such learning paradigm suffers from selection bias due to the lack of feedback loops: the bias occurs when a model itself influences the generation of data which is later used for training. In this work, we complete the feedback loop and alleviate such bias by leveraging GIRL to learn a policy that maximizes the likelihood between the expected gradient optimization direction from pseudo labels, and the average gradient optimization direction on labeled data.

Relational Label Generator
The Relational Label Generator (RLG) aims to obtain contextualized relational features for each in-put sentence based on the entity pair, and classify the entity pair into specific relations. In this work, we assume named entities in the sentence have been recognized in advance.
For a sequence of words in a sentence x where two entities E1 and E2 are mentioned, we follow the labeling schema adopted in Soares et al. (2019) and argument x with four reserved tokens to mark the beginning and the end of each entity. We inject the and h R is the contextualized relational representation length. The RLG then classifies these representations into specific relations with a fully connected network. We adopt this architecture to generate labels on sentences, and denoted the RLG process as f θ (x, E1, E2).

Gradient Imitation Reinforcement Learning
Generally, we assign pseudo labels via RLG on unlabeled data as pseudo-labeled data, and add the selected pseudo-labeled data into the existing labeled data to iteratively improve RLG. We argue that without a feedback loop measuring the quality of pseudo labels, the model is more likely to suffer from selection bias and is impeded towards a better generalization ability. We aim to generate pseudo labels with less labeling biases and errors especially with scarce annotations. To achieve this goal, we focus on improving the RLG performance by introducing gradient imitation to define and quantify what an appealing behavior looks like. We define the partial derivatives of the loss function corresponding to RLG parameters on the labeled data as standard gradient descending, and assume that when pseudo-labeled data are correctly labeled in RLG, partial derivatives to the RLG parameters on the pseudo-labeled data would be highly similar to standard gradient descending. Following this assumption, we propose Gradient Imitation Reinforcement Learning (GIRL), which optimizes RLG under a reinforcement learning framework (Williams, 1992). Now we explain the reinforcement learning process in detail. State: State is used to signal the optimization status. We use s (t) to denote the state. s (t) consists of the updated labeled dataset D l at step t, along with a standard gradient direction g l at step t. Policy: Our policy is learned to assign correct pseudo label on unlabeled data. The policy network is parameterized by the RLG network f θ . Action: The action is to predict relational label on unlabeled data x (t) as pseudo-labeled data ( x (t) , y (t) ) given the State at step t. We consider the relation that corresponds to the maximum probability after softmax as the pseudo label: (1) Reward: We use reward to signal labeling biases from the current policy on pseudo-labeled data.
Our goal is to minimize the approximation error of the gradients obtained over the pseudo-labeled data. In other words, we maximize the correlation between gradients over the pseudo-labeled data and those over the labeled data. We define the standard gradient descent direction on the all N labeled data as g l and the expected gradient descent direction on the pseudo-labeled data as g p respectively: where ∇ θ refers to the partial derivatives of the cross entropy loss L corresponding to Policy f θ with respect to θ. Considering that the outliers in the labeled data will affect the direction of standard gradient descent, we approximate g l over all N labeled data and we define L l and L p as: where loss is the cross entropy loss function, f θ (x (n,E1,E2) ) returns a probability distribution over all relation categories for the n-th sample and one_hot(y (n) ) returns a one-hot vector indicating the target label assignment.
Since the most important guidance obtained by the gradient vector g l is its gradient descending direction, so we measure the discrepancy between g l and g p for state s (t) by defining their cosine similarity as the reward: The range of we treat them as positive reinforcement to improve the generalization ability of RLG network. We add these selected pseudo-labeled data to the labeled data and correct the standard gradient descending direction: For Eq. (8), we set the weight of the updated gradient direction according to the number of samples, where the standard gradient direction is calculated using all N labeled samples and each pseudo labeled sample. The positive feedback obtained from GIRL via trial and error can attribute the improvement of RLG network (Policy) to assign correct pseudo label for next unlabeled data x (t) (State).

Reinforcement Learning Loss
We adopt the REINFORCE algorithm (Williams, 1992) and Policy Gradient for optimization. We calculate the loss over a batch of pseudo-labeled samples. The RLG will be optimized by GIRL on each batch according to the following reinforcement learning loss: where loss is the cross entropy loss function, R (t) is the reward and y (t) ∼ π(·| x (t,E1,E2) ; θ). The π function means Policy in reinforcement learning. In our setting, it is parameterized as f θ , which is learned to assign pseudo labels on unlabeled data and we minimize L(θ) to optimize the θ. T represents a total number of time steps in a reinforcement learning episode and is set to 16, the same number as the batch size. For each high reward R (t) > λ, λ = 0.5 pseudo-labeled data, we use it to dynamically update the labeled dataset / standard gradient direction and guide the reinforcement learning process to the next State.
Note that f θ is first pretrained using all the labeled data in a supervised way. During the process of calculating reinforcement learning loss, our model follows the Markov's decision process and the labeled data D l and standard gradient descending direction g l will be dynamically corrected by the selected pseudo-labeled data D p , which means that for each State, Policy will be updated over time t. The RLG could solicit positive feedback obtained using GIRL via trial and error.

Contextualized Data Augmentation
Except the typical LRE scenario where both limited labeled data and large amounts of unlabeled data are available, GradLRE handles an extreme yet practical LRE scenario additionally, where only limited labeled data is available. As shown by the orange arrow in Figure 2, we propose to use a contextualized augmentation method, namely CDA, to generate more unlabeled data.
Given a sentence x where two entities E1 and E2 are mentioned in the labeled data, CDA samples spans of the sentence as [MASK] until the masking budget has been spent (e.g., 15% of x) and finally fills the mask with tokens using the pretrained language model. Inspired by Joshi et al. (2020), we sample a span length from a geometric distribution ∼ Geo(p) where ∈ [1, 10]. p will affect the probability of selecting different span lengths. A larger p leads to a shorter span. We follow Joshi et al. (2020) and choose p = 0.2. The Geo(0.2) yields a mean span length of ( ) = 3.8 and shorter spans are more inclined to be chosen. We skip E1 and E2 as [MASK] and also require the starting point of the span must be the beginning of one word which ensures to mask complete words.
For example, we may mask the word DELIV-ERED TO in "A letter was delivered to my office in this morning." and obtain an augmented sentence "A letter was sent from my office in this morning.". Compared with the original labeled data, the augmented sentence may have a different relation label. We therefore use RLG, which has a strong discriminate power, to assign a correct label to the augmented unlabeled sentence. Since "no relation" has been defined as one valid relation category in the dataset, RLG has the capability to safely assign one augmented sentence as "no relation" when it is out of scope.

Experiments
We conduct extensive experiments on two datasets to prove the effectiveness of our Gradient Imitation Reinforcement Learning for low resource relation extraction tasks, and give a detailed analysis of each module to show the advantages of GradLRE.

Datasets
We follow Hu et al. (2021) to conduct experiments on two public RE datasets, including the SemEval 2010 Task 8 (SemEval) (Hendrickx et al., 2010), and the TAC Relation Extraction Dataset (TACRED) . SemEval is a standard benchmark dataset for evaluating relation extraction models, which consists of training, validation, test set with 7199, 800, 1864 relation mentions respectively, with 19 relations types in total (including no_relation), of which no_relation percentage is 17.4%. TACRED is a large-scale crowd-sourced relation extraction dataset which is collected from all the prior TAC KBP relation schema. The dataset consists of training, validation, test set with 75049, 25763, 18659 relation mentions respectively, with 42 relation types in total (including no_relation), of which no_relation percentage is 78.7%.

Baselines and Evaluation metrics
GradLRE is flexible to integrate different contextualized encoders. From Table 1 For baselines, we compare GradLRE with other six representative methods: (1) Self-Training (Rosenberg et al., 2005) iteratively improves model by predicting unlabeled data with pseudo labels and adds these pseudo label data to labeled data.
(2) Mean-Teacher (Tarvainen and Valpola, 2017) is jointly optimized by a perturbation-based loss and a training loss to ensure that the model makes consistent predictions on similar data. (3) DualRE (Lin et al., 2019) treats relation extraction as a dual task from relations to sentences and combines the loss of a prediction module and a sentence retrieval module. The difference between Pairwise and Pointwise schemes lie in whether the retrieved documents are given scores or a relative order. (4) RE-Ensemble (Lin et al., 2019) replaces the retrieval module in the proposed DualRE framework with the same prediction module. (5) MRefG (Li and Qian, 2020) semantically connects the unlabeled data to the labeled data by constructing reference graphs, including entity reference, verb reference and semantics reference. (6) MetaSRE (Hu et al., 2021) is the state-of-the-art method that generates pseudo labels on unlabeled data by meta learning from the successful and failed attempts on classification module as an additional meta-objective.
Finally, we present another model: BERT w. gold labels, which indicates the upper bound of LRE models when all unlabeled data has gold labels during training with labeled data.
For the evaluation metrics, we choose F1 score as the main metric. Note that following Hu et al. (2021), the correct predictions of no_relation are ignored.

Implementation Details
For the two datasets, strictly following the settings used in Hu et al. (2021), we use stratified sampling to divide training set into labeled and unlabeled datasets of various proportions to ensure all subsets share the same relation label distribution. For SemEval, we sample 5%, 10% and 30% of the training set, for TACRED, we sample 3%, 10% and 15% of the training set as labeled datasets. For both datasets, we sample 50% of the training set as unlabeled dataset. As suggested in Hu et al. (2021), we split all unlabeled data into 10 segments. In each iteration, RLG is optimized based on one segment of the data. The RLG gradually improves as we obtain more high-quality pseudo labels one iteration after another. We implement this strategy for our model and the baselines. For the evaluation metrics, we choose F1 score as the main metric.
For RLG, we use the BERT default tokenizer with max-length as 128 to preprocess data. We use pretrained BERT-Base_Cased as the initial parameter to encode contextualized entity-level representation. The fully connected network is defined with layer dimensions of 2h R -h R -label_size, where h R = 768. We use BertAdam with 1e−4 learning rate and warmup with 0.1 to optimize the loss. For GIRL, the total time step T is set to 16, the same number as the batch size. We use AdamW (Loshchilov and Hutter, 2018) with 5e−5 learning rate to optimize the reinforcement learning loss. Table 1 shows the mean and standard deviation F1 results with 5 runs of training and testing on SemEval and TACRED when leveraging various labeled data and 50% unlabeled data. All methods could gain performance improvements from the unlabeled data when compared with the model that only uses labeled data (BERT), which demonstrates the effectiveness of unlabeled data in the LRE setting. We could observe that GradLRE outperforms all baseline models consistently. More specifically, compared with the previous SOTA model MetaSRE, GradLRE on average achieves 1.21% higher F1 on SemEval and 1.15% higher F1 on TACRED across various labeled data. When considering standard deviation, GradLRE is also more robust than all the baselines.

Main Results
Considering LRE when labeled data is very scarce, e.g. 5% for SemEval and 3% for TA-CRED, GradLRE could achieve an average 1.27% F1 boost compared with MetaSRE. When more labeled data is available, 30% for SemEval and 15% for TACRED, the average F1 improvement is consistent, but reduced to 0.85%. We attribute the consistent improvement of GradLRE to the explicit feedback which GIRL is adopted and learning via trial and error: we use Gradient Imitation as a proxy for the classification loss in optimizing RLG. The guidance from the gradient direction, as a part of the gradient imitation process, is more instructive, explicit, and generalizable than the implicit signals from training directly on labeled data.
We further vary the ratio of unlabeled data and report performance in Figure 3. F1 performance on a fixed 10% labeled data and 10%, 30%, 50%, 70%, 90% unlabeled data are reported. Note that both labeled data and unlabeled data come from the training set, so we can provide unlabeled data with an upper limit of 90%. We could see that almost all methods have performance gains with the addition of unlabeled data and GradLRE achieves consistently better F1 performance, with a clear margin, when comparing with baselines under all different ratios of unlabeled data.

Effectiveness of Gradient Imitation Reinforcement Learning
The main purpose of GIRL is to guide RLG to generate pseudo labels with the similar optimization outcomes as labeled data on the unlabeled data. GIRL minimizes the discrepancy between the gradient vectors obtained from the labeled data and generated data. To demonstrate the effectiveness of Gradient Imitation Reinforcement Learning, we first conduct an ablation study in this section. GradLRE w/o Gradient Imitation Reinforcement Learning is essentially the same as the Self-Training BERT baseline, which iteratively updates model with the synthetic set containing labeled data and generated data without Gradient Imitation Re-  inforcement Learning. From Table 1, we observe GradLRE w/o Gradient Imitation Reinforcement Learning (Self-Training BERT ) gives us 5.38% loss on F1, averaged over all various amounts of labeled data on two datasets.
We identify that the performance gains of GradLRE come from the improved pseudo label quality by adopting GIRL. To validate this, we draw a box plot to show the pseudo label F1. From Figure 4, we could find for the two datasets with different ratios of the labeled data, GIRL could un-   doubtedly improve the F1 performance of pseudo labels. In the case of 30% SemEval and 15% TA-CRED where labeled data is less scarce, GIRL can obtain more accurate gradient directions based on an increased set of labeled data. As a result, pseudo label performance improvements are more significant.
More specifically, we show the gradient descent direction of GradLRE on labeled data and pseudo label data in Figure 5. Considering the overly-large parameters in RLG, we use Principal Component Analysis (Wold et al., 1987) to reduce the dimension of the parameters to 2, and reflect the direction of gradient descent according to the update of the parameters. Although the optimization direction of pseudo label data fluctuates at the beginning, GIRL is gradually improving and ends up closer to the ideal local minima. When GIRL is not used, the optimization is appealing at the first because of the initial positive gains from the self-training schema. However, the error-prone pseudo labels obtained without instructive feedback gradually push the optimization away from the local minima, which leads to reduced generalization ability.
We further study cases where pseudo labels are improved with GIRL on SemEval, and present them in Table 2. GradLRE w/o GIRL tends to predict the pseudo label as Other with the most occurrences, most likely because Other being the dominating class in the dataset. GradLRE w. GIRL is less sensitive to the label distribution in the data and assigns correct labels. We also observe cases where GIRL is doing better at distinguishing the nuances between similar relations such as Content-Container and Component-Whole.

Handling various LRE scenarios
Considering both labeled/unlabeled data as the resource, we introduce the following LRE scenarios: 1) L+U: Limited labeled data and 50% unlabeled data. 2) L+CDA: Only limited labeled data is available. No unlabeled data is available -we lever-  Table 4: CDA on labeled data to obtain generated data, where red and blue represent head and tail entities respectively, cyan represents the replaced words.
age Contextualized Data Augmentation (CDA) to generate the same amount of data via augmenting the labeled data. 3) L: This is the baseline where the model is trained only on limited labeled data. We present results in Table 3.
Compared to L, L+CDA achieves an average 4.01% improvement in F1, indicating the effectiveness of augmentation. We also observe that L+CDA obtain competitive performance when compared with L+U on SemEval. On a more challenging TA-CRED dataset, L+CDA achieves only 2.07% less in F1, comparing with L+U when 6.36x less total samples are initially acquired.
We also vary the ratio of unlabeled data (accessible by L+U or augmented using L+CDA). From Figure 6, L+CDA outperforms L consistently, with the ratio of unlabeled data increasing, L+CDA can get more discriminative data and obtain better performance: it can achieve almost the same performance as L+U on SemEval. On TACRED, performance difference is less than 1.53% using various ratio of unlabeled data.
We show some sample generated data produced by CDA in Table 4. BERT Masked Language Model could generate replacement words based on the context information. We find that some part of the sentences with the replaced words could still maintain the original relational information, although the semantic information of another part of the sentence has changed, the RLG can still have the capability to classify the sentence into the most suitable relation.

Related Work
Relation Extraction aims to predict the binary relation between two entities in a sentence. Recent literature leverage deep neural network to encode the features among two entities from sentences, and then classify these features into pre-defined specific relation categories. These methods could gain decent performance when sufficient labeled data is available (Zeng et al., 2015;Guo et al., 2020;Nan et al., 2020). However, it's labor-intensive to obtain large amounts of manual annotations on corpus. Low resource Relation Extraction methods gained a lot of attention recently (Levy et al., 2017;Tarvainen and Valpola, 2017;Lin et al., 2019;Li and Qian, 2020;Hu et al., 2021Hu et al., , 2020, since these methods require fewer labeled data and deep neural networks could expand limited labeled information by exploiting information on unlabeled data to iteratively improve the performance. One major method is the self-training work proposed by Rosenberg et al. (2005). Self-training incrementally assigns pseudo labels to unlabeled data and leverages these pseudo labels to iteratively improve the classification capability of the model. However, these methods always endure gradual drift problem (Curran et al., 2007;Zhang et al., 2016;Arazo et al., 2019;Han et al., 2018;Jiang et al., 2018;: during the training process, the generated pseudo label data contains noise and could not been corrected through the model itself. Using these pseudo label data iteratively cause the model to deviate from the global minima. Our work alleviates this problem by encouraging pseudo-labeled data to imitate the gradient optimization direction on the labeled data, and introducing an effective feedback loop to improve generalization ability via reinforcement learning. Reinforcement Learning is widely used in Nature Language Processing (Narasimhan et al., 2016;Li et al., 2016;Su et al., 2016;Yu et al., 2017;Takanobu et al., 2019). These methods are all designed with rewards to force the correct actions to be executed during the model training process, so as to improve model performance. Zeng et al. (2019) applies policy gradient method to model future reward in a joint entity and relation extraction task. In our work, we define reward as the cosine similarity between gradient vectors calculated from pseudo-labeled data and labeled data.
Data augmentation methods are leveraged in natural language processing to improve the generalization ability of the model by generating discriminative samples (Kobayashi, 2018;Dai and Adel, 2020;. Gao et al. (2019) contextually augment data by replacing the one-hot representation of a word by a distribution provided by BERT over the vocabulary. However, they only consider the replacement of a word which limits its capability to expand the sentence semantics (Joshi et al., 2020). In our work, we use [MASK] to replace a span of words and leverage BERT Masked Language Modeling task to fill the [MASK].

Conclusion
In this paper, we propose a reinforcement learning framework model GradLRE for low resource RE. Different from conventional self-training models which endure gradual drift when generating pseudo labels, our model encourages pseudo-labeled data to imitate the gradient optimization direction in labeled data to improve the pseudo label quality. We find our learning paradigm gives more instructive, explicit, and generalizable signals than the implicit signals that are obtained by training model directly with labeled data. Contextualized data augmentation is proposed to handle the extremely low resource RE situation where no unlabeled data is available. Experiments on two public datasets show effectiveness of GradLRE and augmented data over competitive baselines.