Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study

This paper presents an empirical study to build relation extraction systems in low-resource settings. Based upon recent pre-trained language models, we comprehensively investigate three schemes to evaluate the performance in low-resource settings: (i) different types of prompt-based methods with few-shot labeled data; (ii) diverse balancing methods to address the long-tailed distribution issue; (iii) data augmentation technologies and self-training to generate more labeled in-domain data. We create a benchmark with 8 relation extraction (RE) datasets covering different languages, domains and contexts and perform extensive comparisons over the proposed schemes with combinations. Our experiments illustrate: (i) Though prompt-based tuning is beneficial in low-resource RE, there is still much potential for improvement, especially in extracting relations from cross-sentence contexts with multiple relational triples; (ii) Balancing methods are not always helpful for RE with long-tailed distribution; (iii) Data augmentation complements existing baselines and can bring much performance gain, while self-training may not consistently achieve advancement to low-resource RE. Code and datasets are in https://github.com/zjunlp/LREBench.


Introduction
Relation Extraction (RE) aims to extract relational facts from the text and plays an essential role in information extraction . The success of neural networks for RE has been witnessed in recent years; however, open issues remain as they still depend on the number of labeled data in practice. For example, Han et al. (2018) found that the model performance drops dramatically as the number of instances for one relation decreases, e.g., for long-tail. An extreme scenario is few-shot * Equal contribution and shared co-first authorship.
Many efforts are devoted to improving the generalization ability beyond learning directly from limited labeled data. Early, Mintz et al. (2009) proposes distant supervision for RE, which leverages facts in KG as weak supervision to obtain annotated instances. Rosenberg et al. (2005); Liu et al. (2021a); Hu et al. (2021) try to assign pseudo labels to unlabeled data and leverage both pseudo-labeled data and gold-labeled data to improve the generalization capability of models iteratively. Some studies apply meta-learning strategies to endow a new model with the ability to optimize rapidly or leverage transfer learning to alleviate the data-hungry issue Yu et al., 2020b;Li et al., 2020a;. Other studies  focus on the long-tailed class distribution, especially in tail classes that only allow learning with a few instances. With the prosperity of the pre-trained language models (PLMs), the pre-train -fine-tune paradigm has become standard for natural language processing (NLP), leading to a tremendous increase in LRE performance. More recently, a new methodology named prompt learning has made waves in the community by demonstrating astounding few-shot capabilities on LRE Chen et al., 2022d).
In this work, we benchmark more realistic scenarios on diverse datasets for low-resource RE, in which models have to handle both extreme few-shot instances and long-tailed distribution, and can also make use of data augmentation or unlabeled in-domain data without cross-validation (Perez et al., 2021). These settings are appealing as: (i) Such models mirror deployment in applied settings; (ii) Few-shot settings are realistic with longtailed distribution; (iii) Diverse datasets cover different languages (Chinese and English), domains (general, scientific), and contexts (one or more sentences with single or multiple relational triples).
Specifically, we focus on improving the generalization ability from three directions shown in Figure 1. Instead of using limited few-shot data, we create different types of prompts for RE and empirically analyze low-resource performance. We further implement many popular balancing methods for long-tailed distribution, which can mitigate performance decay in instance-scarce (tail) classes. We also leverage more generated training instances by data augmentation and self-training in conjunction with the limited labeled data.
Our contributions include: (i) We present the first systematic study for low-resource RE, an important problem in information extraction, by investigating three distinctive schemes with combinations. (ii) We conduct extensive comparisons with in-depth analysis on 8 RE datasets and report empirical results with insightful findings. (iii) We release both the data and the source code of these baselines as an open-sourced testbed for future research purposes.
To shed light on future research on low-resource RE, our empirical analysis suggests that: (i) Previous state-of-the-art methods in the low-resource setting still struggle to obtain better performance than that in the fully-supervised setting (Cross-sentence LRE is extremely challenging), which indicates that there is still much room for low-resource RE.
(ii) Balancing methods may not always benefit low-resource RE. The long-tailed issue can not be ignored, and more studies should be focused on model development. (iii) With some simple data augmentation methods, better performance can be achieved, highlighting opportunities for future improvements on low-resource RE.
2 Background on Low-resource RE 2.1 Low-resource RE RE is a classification task that aims to assign relation labels to entity pairs in given contexts. Formally, in a RE dataset denoted as D = {X, Y}, X is the set of texts and Y is the set of relation labels. Given a text x = {w 1 , w 2 , . . . , w s , . . . , w o , . . . , w |x| }, where x ∈ X, RE aims to predict the semantic relation y x ∈ Y holding between the subject entity w s and the object entity w o . Conventional RE systems are trained in the standard supervised learning regime, where large amounts of labeled examples are required. Nevertheless, owing to various languages, domains, and the cost of human annotation, there is commonly a very small number of labeled examples in real-world applications. Thus, traditional supervised learning with few-shot labeled data struggle to achieve satisfactory performance (Schick and Schütze, 2021). Consequently, a challenging task, low-resource RE, has emerged.

Fine-tuning PLMs for RE
A typical baseline method for RE is to finetune a PLM M as shown in Figure 2 Then, a [CLS] head is used to compute the probability distribution over the class set Y with the softmax p(·|x) = Softmax(Wh [CLS] + b), where W is a set of learnable weight parameters randomly initialized at the start of fine-tuning, h [CLS] is the hidden vector of [CLS] and b is the learnable bias. All learnable parameters are fine-tuned by minimizing the cross-entropy loss over p(y x |x) on D. Nevertheless, conventional supervised fine-tuning may over-fit a few training examples and perform poor generalization ability over test sets when encountering the low-resource RE task.

Methods for Low-resource RE
In this paper, we conduct a comprehensive empirical study with three distinctive schemes against difficulty in low-resource RE: PLMs-based promptbased tuning, balancing long-tailed data and leveraging more instances, as shown in Figure 2.

Prompting for Few-shot Instances
To address the low-resource issue of data sparsity for RE, we firstly analyze prompting methods. Unlike standard fine-tuning, prompt-based tuning reformulates classification tasks as clozestyle language modeling problems and predicts an- swer words, denoted as V, through the masked language model (MLM) head. Specifically, T prompt converts every instance x into a prompt input x prompt = T prompt (x), in which there is at least one [MASK] for M to fill with right answer words v ∈ V. Meanwhile, a verbalizer connects relation labels with answer words via an injective mapping γ : Y → V. With the aforementioned functions, we can formalize the probability distribution over Y with the probability distribution over V at the masked position (Ma et al., 2021): where W lm is a set of parameters of the PLM head. Note that the main difference between various prompt-based tuning methods lies in the design of the prompt template and verbalizer. Thus, we benchmark different kinds of prompting methods in low-resource RE to empirically investigate their performance. For the prompt template, given the input x, the first choice is manually designing the template. We utilize the natural language or task schema to formulate different prompt templates. Formally, we have: where <sub> is the head entity mention and <obj> is the tail entity mention. Since there exists rich semantic knowledge within relation labels and structural knowledge implications among relational triples, we also benchmark previous studies such as PTR  and KnowPrompt (Chen et al., 2022d) which incorporates relational knowledge into prompt-based tuning as shown in Figure 2(b).

Balancing for Long-tailed Distribution
Learning with long-tailed data, where the number of instances in each class highly varies, is a common challenge in low-resource RE because instance-rich (head) classes predominate the training procedure. Note that the learnable parameters of the trained model prefer to perform better in these head classes and worse in less frequent (tail) classes (Kang et al., 2020a). To address this issue, we explore two balancing methods: re-sampling data and re-weighting losses for low-resource RE.
Re-sampling Data We re-sample RE datasets to balance the data distribution. For example, the tail classes can be over-sampled by adding copies of data, and the head classes can be under-sampled by removing data, as shown in Figure 2(c). Specifically, we use a toolkit 2 , which can estimate the sampling weights automatically when sampling from imbalanced data to obtain datasets with the nearly balanced distribution.

Re-weighting Loss
We utilize various reweighting losses, assigning different weights to different training instances for each class. For instance, DSC Loss (Li et al., 2020b) attaches similar importance to false positives and false negatives. Focal Loss (Lin et al., 2020a) balances the sample-wise classification loss for model training by down-weighing easy samples. GHM Loss (Li et al., 2019a) applies a gradient harmonizing mechanism, making the model ignore outliers to conquer the disharmony in classification. LDAM Loss (Cao et al., 2019) expands the decision boundaries of few-shot classes.

Leveraging More Instances via Data Augmentation and Self-training
It is also beneficial to leverage more instances to address the low-resource issue. We conduct data augmentation and also leverage unlabeled in-domain data via self-training, as shown in Figure 2(d).
Data augmentation (DA) automatically generates more labeled instances based on only a few labeled instances. For example, we utilize tokenlevel augmentation, which changes or inserts words and phrases in a sentence to generate augmented text remaining with the same labels as the original text. In this work, we apply three DA methods for English RE datasets to substitute words in training sets based on WordNet's synonyms, TF-IDF similarity and the contextual word embedding implemented by nlpaug 3 . And we replace words with their synonyms via nlpcda 4 to augment Chinese RE samples. We further analyze different types of augmentation objects in RE regarding contexts, entities, and both of them.
Since substantial easily-collected unlabeled data are also leveraged in this work for low-resource RE, we conduct self-training, a classical, intuitive and straightforward semi-supervised learning method. Specifically, we train a model with labeled data and then expand the labeled set according to the most confident predictions (a.k.a. pseudo labels) on unlabeled data. We combine the data with gold and pseudo labels to obtain the final RE model. The details of the whole self-training pipeline are described in Appendix A.5.

Benchmark Design
In this paper, we provide a comprehensive empirical study for low-resource RE and design the LREBench (Low-resource Relation Extraction Benchmark) to evaluate various methods. In the following section, we will detail the datasets chosen for experiments and the reproducibility of all baselines mentioned above.

Datasets Selection
As shown in Table 1, we select 8 RE datasets to evaluate baselines in low-resource settings, covering various domains: SemEval 2010 Task 8 5 (Hendrickx et al., 2009) (Luan et al., 2018) Table 1: Statistics on the 8 public RE datasets selected for evaluation in LREBench. MS indicates if datasets contain instances with multiple sentences in one text, and MT indicates if one text in these datasets can be related to multiple relational triples. "*" means that we re-sample and convert Wiki80 into long-tailed distribution through an exponential function since its original distribution is exactly balanced. "cn" represents datasets with Chinese.
such as DuIE2.0 and CMeIE. Besides, the SciERC, ChemProt, DialogRE, and CMeIE datasets contain the situation where multiple sentences are in one instance, which is for cross-sentence RE and more challenging than single-sentence RE in SemEval, TACREV and Wiki80.
For simplicity, we provide a unified input-output format for all datasets in the low-resource setting 13 . Specifically, each instance in LREBench consists of one text and one relational triple (one head entity and one tail entity in the text and the corresponding relation between them). For those datasets with instances having one text related to multiple relational triples, such as ChemProt, SciERC, Di-alogRE, DuIE2.0 and CMeIE, we follow Zhong and Chen (2021) to place such a text to multiple instances with only one relational triple. In this way, we can utilize a unified input-output format for widespread models.
We conduct experiments in three settings with different proportions of training data to simulate different resource levels: 8-shot, 10% and 100%. For the 8-shot setting, we sample 8 instances for each relation category in the training and test sets 14 . For the 10% and 100% settings, we sample 10 percent of the training set and use the whole training set, respectively. Since fine-tuning on small datasets can suffer from instability and results may change dramatically given a new split of data , we sample all training datasets 5 times randomly in 8-shot and 10% settings and measure their average performance in experiments. Also, we follow the same sampling strategy in the re-sampling long-tailed data method and data augmentation methods to obtain a fair comparison.

Reproducibility
Methods Throughout our experiments, we employ M = RoBERTa-large  for SemEval, TACREV, Wiki80 and DialogRE, Chinese RoBERTa-large (Cui et al., 2020) for DuIE2.0 and CMeIE, and BioBERT-large (Lee et al., 2020) for ChemProt and SciERC from HuggingFace 15 as the backbone network (detailed in Appendix A.1). For each method, we investigate the following three schemes in different settings for the comparative empirical study, as shown in Table 2: (i) Normal is the general scheme with the PLM for low-resource relation extraction, in which we evaluate with 8shot, 10% and 100% settings. (ii) Balance refers to balancing methods in §3.2 for long-tailed data distribution with 10% and 100% settings. We list the best performance among all balancing methods for each dataset in Table 2 and detailed results in Table 3. (iii) Data augmentation (DA) methods are applied to 10% training sets. We list the best performance among all DA methods in Table 2 and all performance in Table 4. We also conduct self-training (ST) that firstly trains a teacher M on 10% training data and then tags the rest 90% training data with pseudo labels by M. Both goldlabeled and pseudo-labeled data are used to obtain a final student RE model as introduced in §3.3.

Training and Evaluation
We only train models on training sets without validation on development sets to ensure true few-shot learning with limited labeled data. For all training data sizes, we set the training epoch = 10 following . Except for re-weighting losses for addressing the long-tailed problem, the cross-entropy loss is used in all training processes. Since the performance of head and tail classes varies a lot, we use both Macro F1 and Micro F1 together as the evaluation metrics. Implementation details can be found in Appendix A.  Balance represents balancing methods for long-tailed data. DA is data augmentation. ST refers to self-training with unlabeled in-domain data. Results colored with red means prompt-based tuning works worse than fine-tuning between two Normal columns. blue, orange, and purple results indicates the performance of balancing methods, data augmentation and self-training is poorer than the Normal method in the same setting.

Main Results
We leverage the basic PLM fine-tuning code from OpenNRE 16  and the state-of-theart prompt-based RE method KnowPrompt (Chen et al., 2022d) to conduct extensive experiments across 8 datasets in various methods and settings. The results of the main experiments are shown in Table 2, which illustrates the following findings: Finding 1: Prompt-based tuning largely outperforms standard fine-tuning for RE, especially more effective in the low-resource scenario. The comparison between the results of standard fine-tuning and prompt-based tuning indicates that prompts can provide task-specific information and bridge the pre-train -fine-tune gap, thus, empowering PLMs in low-resource RE.
Finding 2: Though balancing methods obtain advancement with long-tailed distribution, they may still fail on challenging RE datasets, such as ChemProt, DialogRE and DuIE2.0. By comparing Macro F1 Scores of the Balance columns and Normal columns, blue (bad) results illustrate that balancing methods are affected by complexity of long contexts with multiple sentences and 16 https://github.com/thunlp/OpenNRE relational triples.
Finding 3: Data augmentation achieves much gain on RE and sometimes even better performance than prompt-based tuning, such as on SemEval, according to the difference between two pairs of DA columns and Normal columns in the 10% setting. More data generated through DA methods are complementary with other baselines, boosting the performance.
Finding 4: RE systems struggle against difficulty in obtaining correct relations from crosssentence contexts and among multiple triples. The extremely low F1 scores for 8-shot ChemProt, and DialogRE datasets in standard fine-tuning demonstrate this finding. One text in ChemProt is related to too many relational triples (there are 347 texts related to 3 triples and 699 texts related to 2 triples in the training set). At the same time, in DialogRE, the input text is extremely long (one text can contain 10 sentences). Even with the powerful prompt-based tuning method, it is non-trivial to address the low-resource issue according to the unexpected drop in F1 scores of ChemProt and SciERC.
Finding 5: Self-training with unlabeled indomain data may not always show an advantage for low-resource RE. There is much noise in those generated pseudo labels. Furthermore, for assigning labels in RE, both semantics and positions of entities in a text need to be considered, which is exceedingly challenging. Therefore, the model with self-training cannot always obtain better performance in low-resource settings.

Comprehensive Empirical Analysis
Different Prompting Methods To investigate the effects of different prompts, we conduct an empirical analysis on SemEval, TACREV, SciERC and ChemProt as shown in Figure 3. We observe the following insights: (i) Prompt-based tuning is more beneficial in general domains than specific domains for low-resource RE. Prompt-based tuning achieves the most gain, 44.85% Micro F1 Score, by comparing fine-tuning and KnowPrompt on 8-shot SemEval, while obtaining the worst drop, 25.65% Micro F1 Score, by comparing fine-tuning and the template prompt on 8-shot SciERC even with the domain-specific PLM. Except for the difficulty of these two datasets, general manual prompts have little domain knowledge related to vertical domains, hindering performance. (ii) Entity type information in prompts is helpful for low-resource RE. The head and tail entity types in prompts provide strong constraints between relations and their related entities. Prompting methods with entity type information in KnowPrompt and PTR perform better than the template and schema-based prompt in most datasets, which illustrates that prompts with entity-type information are more appropriate for low-resource RE. The reason for the abnormal phenomenon that KnowPrompt and PTR obtain worse results than the template and schema-based prompts in TACREV is that annotation errors in the training set of TACREV (Stoica et al., 2021) can lead to overestimation of the performance of models depending on the side information of entities such as entity names, spans and types (Zhou and Chen, 2021), and the templates of KnowPrompt and PTR are natural language sentences consisting of the head, and tail entities and their relations, which require high-quality annotated entity mentions, positions, types and relational words, while they are relatively trivial to the template and schema-based prompts.

Different Balancing Methods
We also conduct experiments to validate the effectiveness of different balancing methods on two long-tailed datasets. We categorize the classes into three splits based on the number of training instances per class, including Few, Medium, and Many, and also report the results on the whole dataset with the Overall setting in Table 3 (split schemes are in Appendix B). We notice that with re-balancing methods (e.g., Focal Loss and LDAM Loss), the tail relations (Few) can yield better performance on both general and domain-specific datasets. However, some technologies, such as GHM-C, fail to contribute to performance gains. Overall, our empirical analysis illustrates that the RE performance can be improved with balancing methods, which indicates that longtailed RE is a challenging classification task, and it should be paid more attention to developing suitable methodologies.   Table 4: Micro F1 Scores (%) on four datasets generated by different data augmentation methods from 10% training sets. Three DA methods are conducted to substitute words at three positions: only in contexts, only in entities and in both of them. "-" represents non-repeated data generated based on contextual word embedding is not available. from 10% training sets by substituting tokens based on three methods. From Table 4, we notice that DA with WordNet can obtain the best performance improvement in most cases. Further, we observe that DA methods can rise by 13.6% and 5.92% Micro F1 Scores mostly on SemEval and SciERC compared to origin prompt-based tuning, demonstrating that DA contributes a lot in the low-resource scenario. Besides, we observe that the performance improvement is much smaller in specific domains, such as SciERC and ChemProt, than in the general domain. We think that because there are many specific terms in vertical domains, it is challenging to obtain qualified augmented instances, which causes to yield lower performance improvement.
Prompting Methods for RE Though fine-tuning PLMs has waved the NLP community, there is still a big gap between pre-training and fine-tuning objectives, hindering the few-shot performance. Hence, prompt-based tuning is proposed in GPT-3 (Brown et al., 2020) and drawn much attention. A series of researches have illustrate the decent performance of prompt-based tuning (Shin et al., 2020;Lester et al., 2021;Li and Liang, 2021), especially in few-shot classification tasks (Schick and Schütze, 2021;Liu et al., 2021b;. Typically, PTR  encodes prior knowledge using logic rules in prompt-based tuning with several sub-prompts for text classification. KnowPrompt (Chen et al., 2022d) incorporates knowledge among relation labels into prompt tuning for RE with synergistic optimization for better performance.

Methods for Long-tailed Distribution Data
Many re-balancing methods are proposed to tackle the long-tailed problem (Kang et al., 2020b;Nan et al., 2021). Data distribution re-balancing meth-ods re-sample the dataset into a more balanced data distribution (Han et al., 2005;Mahajan et al., 2018). Various re-weighing losses (Cui et al., 2019;Li et al., 2019aLi et al., , 2020bLin et al., 2020a;Cao et al., 2019) assign balanced weights to training samples from each class. For RE, Nan et al. (2021) introduces causal inference to mitigate the spurious correlation issues for information extraction.
Data Augmentation for NLP An effective method for NLP in low-resource domains is data augmentation. Token-level DA approaches include replacing tokens with their synonyms (Kolomiyets et al., 2011;Wang and Yang, 2015), deleting tokens (Iyyer et al., 2015), inserting random tokens (Wei and Zou, 2019;Miao et al., 2020) or replacing meaningless tokens with random tokens Niu and Bansal, 2018).

Conclusion
We provide an empirical study on low-resource RE. Specifically, we analyze the prompt-based tuning for few-shot RE, balancing methods for long-tailed RE datasets, and use data augmentation or unlabeled in-domain data. We systematically evaluate baselines on 8 benchmark datasets in low-resource settings (e.g., 8-shot, 10%) and provide insightful findings. We hope this study can help inspire future research for low-resource RE with more robust models and promote transitioning the technology to real-world industrial scenarios.

Limitations
With the fast development of low-resource RE, we cannot compare and evaluate all previous studies due to the settings and non-available open-sourced code. Our motivation is to develop a universal, GLUE-like, and open platform on low-resource RE for the community. We will continue to maintain the benchmark by adding new datasets.
Few Member-Collection (e1,e2) 64 Entity-Destination (e2,e1) 1  3. Train a student model Θ S via cross-entropy L on both gold-labeled data D L and soft-labeled data D SU . The loss function of Θ S is: where λ U is the weighting hyper-parameter, and we set it 0.2 in this work. It is an alternative to iterate from Step 1 to Step 3 multiple times by initializing Θ T in Step 1 with newly learned Θ S in Step 3. We only perform self-training once in our experiments for simplicity because the result is not good, and it is not sensitive to continue the next iteration.