Continual Contrastive Finetuning Improves Low-Resource Relation Extraction

Relation extraction (RE), which has relied on structurally annotated corpora for model training, has been particularly challenging in low-resource scenarios and domains. Recent literature has tackled low-resource RE by self-supervised learning, where the solution involves pretraining the entity pair embedding by RE-based objective and finetuning on labeled data by classification-based objective. However, a critical challenge to this approach is the gap in objectives, which prevents the RE model from fully utilizing the knowledge in pretrained representations. In this paper, we aim at bridging the gap and propose to pretrain and finetune the RE model using consistent objectives of contrastive learning. Since in this kind of representation learning paradigm, one relation may easily form multiple clusters in the representation space, we further propose a multi-center contrastive loss that allows one relation to form multiple clusters to better align with pretraining. Experiments on two document-level RE datasets, BioRED and Re-DocRED, demonstrate the effectiveness of our method. Particularly, when using 1% end-task training data, our method outperforms PLM-based RE classifier by 10.5% and 6.1% on the two datasets, respectively.


Introduction
Relation extraction (RE) is a fundamental task in NLP. It aims to identify the relations among entities in a given text from a predefined set of relations. While much effort has been devoted to RE in supervised settings (Zhang et al., 2017(Zhang et al., , 2018Nan et al., 2020), RE is extremely challenging in high-stakes domains such as biology and medicine, where annotated data are comparatively scarce due to overly high annotation costs. Therefore, there is a practical and urgent need for developing low-resource RE models without the reliance on large-scale end-task annotations.
To realize low-resource RE, previous work has focused on pretraining entity pair embedding on large corpora using RE-based pretraining objectives. Particularly, Baldini Soares et al. (2019) propose a self-supervised matching-theblanks (MTB) objective that encourages embeddings of the same entity pairs in different sentences to be similar. Later work (Peng et al., 2020;Qin et al., 2021) extends this idea with distant supervision (Mintz et al., 2009) and improves representation learning using contrastive learning (Hadsell et al., 2006;Oord et al., 2018;Chen et al., 2020). To adapt to training on RE annotations, these works finetune pretrained entity pair embedding on labeled data using classification-based objectives. Although this paradigm produces better results compared to RE models initialized with pretrained language models (PLMs), it creates a significant divergence between pretraining and finetuning objectives, thus preventing the model from fully exploiting knowledge in pretraining.
In this paper, we aim to bridge this gap in RE pretraining and finetuning. Our key idea is to use similar objectives in pretraining and finetuning. First, we propose to continually finetune pretrained embedding by contrastive learning, which encourages the entity pair embeddings corresponding to the same relation to be similar. However, as pretraining and finetuning are conducted on different tasks, entity pairs of the same relation can form multiple different clusters in the pretrained embedding, where standard supervised contrastive loss (Khosla et al., 2020) may distort the representation because of its underlying onecluster assumption (Graf et al., 2021). Therefore, we further propose a multi-center contrastive loss (MCCL), which encourages an entity pair to be similar to only a subset of entity pairs of the same relation, allowing one relation to form multiple clusters. Second, we propose to use classwise k-nearest neighbors (kNN; Khandelwal et al. 2020Khandelwal et al. , 2021 in inference, where predictions are made based on most similar instances. We focus our work on document-level RE (Jia et al., 2019;Yao et al., 2019), which consists of both intra-and cross-sentence relations. To the best of our knowledge, this work represents the first effort to explore self-supervised pretraining for document-level RE. Unlike prior studies (Peng et al., 2020;Qin et al., 2021), we do not use distant supervision. Instead, we pretrain entity pair embedding with an improved MTB objective on unlabeled corpora, where we use contrastive learning to learn representations that suit downstream RE. We then finetune the pretrained model on labeled data with MCCL. Experiments on two datasets, BioRED (Luo et al., 2022) in the biomedical domain and Re-DocRED (Tan et al., 2022b) in the general domain, demonstrate that our pretraining and finetuning objectives significantly outperform baseline methods in low-resource settings. Particularly, in the low-resource setting of using 1% of labeled data, our method outperforms PLM-based classifiers by 10.5% and 6.1% on BioRED and Re-DocRED, respectively. Based on our pretrained representations, MCCL outperforms classification-based finetuning by 6.0% and 4.1%, respectively. We also find observe that as more data becomes available, the performance gap between MCCL and classification-based finetuning diminishes.
Our technical contributions are three-fold. First, we propose to pretrain the PLMs based on our improved MTB objective and show that it significantly improves PLM performance in lowresource document-level RE. Second, we present a technique that bridges the gap of learning objectives between RE pretraining and finetuning with continual contrastive finetuning and kNNbased inference, helping the RE model leverage pretraining knowledge. Third, we design a novel MCCL finetuning objective, allowing one relation to form multiple different clusters, thus further reducing the distributional gap between pretraining and finetuning.

Related Work
Document-level RE. Existing document-level RE models can be classified into graph-based and sequence-based models. Graph-based models construct document graphs spanning across sentence boundaries and use graph encoders such as the graph convolution network (GCN; Kipf and Welling 2017) to aggregate information. Particularly,  build document graphs using words as nodes with innerand inter-sentence dependencies (e.g., syntactic dependencies, coreference, etc.) as edges. Later work extends this idea by applying different network structures (Peng et al., 2017;Jia et al., 2019) or introducing other node types and edges (Christopoulou et al., 2019;Nan et al., 2020;Zeng et al., 2020). On the other hand, sequencebased methods Zhang et al., 2021;Tan et al., 2022a) use PLMs to learn crosssentence dependencies without using graph structures. Particularly,  propose to enrich relation mention representation by localized context pooling. Zhang et al. (2021) propose to model the inter-dependencies between relation mentions by semantic segmentation (Ronneberger et al., 2015). In this work, we study a general method of self-supervised RE. Therefore, our method is independent of the model architecture and can be adapted to different RE models.
Low-resource RE. Labeled RE data may be scarce in real-world applications, especially in low-resource and high-stakes domains such as finance and biomedicine. Much effort has been devoted to training RE models in low-resource settings. Some work tackles low-resource RE by indirect supervision, which solves RE by other tasks such as machine reading comprehension (Levy et al., 2017), textual entailment (Sainz et al., 2021), and abstractive summarization . However, indirect supervision may not be practical in high-stake domains, where annotated data for other tasks are also scarce. Other efforts (Baldini Soares et al., 2019;Peng et al., 2020;Qin et al., 2021) improve low-resource RE by pretraining on large corpora with RE-based objectives. Specifically, Baldini Soares et al. (2019) propose an MTB objective that encourages embeddings of the same entity pairs in different sentences to be similar. Peng et al. (2020) propose to pretrain on distantly labeled corpora, where they make embeddings of entity pairs with the same distant label to be similar. They also introduce a contrastive learning based training objective to improve representation learning. Qin et al. (2021) further introduce an entity discrimination task and pretrain the RE model on distantly labeled document corpora. In this paper, we study self-supervised pretraining for document-level RE. We study how to reduce the gap between pretraining and finetuning, which is critical to bridge the training signals obtained in these two stages but has been overlooked in prior work.

Method
In this work, we study a self-supervised approach for document-level RE. Given a document d and a set of entities {e i } N i=1 , where each entity e i has one or multiple entity mentions in the document, document-level RE aims at predicting the relations of all entity pairs (e s , e o ) s,o ∈ {1,...,N } from a predefined set of relationships R (including an NA class indicating no relation exists), where e s and e o are the subject and object entities, respectively. In the self-supervised RE setting, we have a large unlabeled document corpus for pretraining and a labeled RE dataset for finetuning. The document corpus has been annotated with entity mentions and the associated entity types but no relations. Our goal is to train a document-level RE classifier, especially in the low-resource setting.
Our training pipeline consists of two phases: pretraining and finetuning. In pretraining, we use the (unlabeled) document corpus to pretrain the entity pair embedding based on our improved matching-the-blanks training objective (MTB; Baldini Soares et al. 2019), where the LM learns to decide whether two entity pair embeddings correspond to the entity pairs or not, and the learning of representation is enhanced with contrastive learning. In finetuning, we continue to train the pretrained model on relation-labeled data using a multi-center contrastive loss (MCCL), which achieves better performance than the traditional classifier paradigm due to its better-aligned learning objective with pretraining. After training, we use classwise k-nearest neighbor (kNN) inference that suits well the contrastively finetuned model.
The rest of this section is organized as follows: we introduce the model architecture used in both pretraining and finetuning in Section 3.1, the pretraining process in Section 3.2, finetuning in Section 3.3, and inference in Section 3.4.

Model Architecture
Encoder. Given a document d = [x 1 , x 2 , ..., x l ], we first mark the spans of the entity mentions by adding special entity markers [E] and [/E] to the start and the end of each mention. Then we encode the document with a PLM to get the contextual embedding of textual tokens: We take the contextual embedding of [E] at the last layer of the PLM as the embedding of entity mentions. We accumulate the embedding of mentions corresponding to the same entity by Log-SumExp pooling (Jia et al., 2019) to get the entity embedding h e i . Entity pair embedding. Given an entity pair t = (e s , e o ) in document d, where e s and e o are the subject and object entities, respectively, we calculate the entity pair embedding by: Here h es , h eo ∈ R d are embeddings of subject and object entities, c es,eo ∈ R d is the localized context encoding for (e s , e o ), W linear ∈ R 3d×d is a linear projector. The localized context encoding is introduced by  to derive the context embedding conditioned on an entity pair, which finds the context that both the subject and object entities attend to. Specifically, denote the multi-head attention in the last layer of PLM as A ∈ R m×l×l , where m is the number of attention heads, l is the input length, we first take the attention scores from [E] as the attention from each entity mention, then accumulate the attention of this entity mention by mean pooling to get the entitylevel attention A (e i ) ∈ R m×l . Finally, we compute c (es,eo) by: We introduce in the rest of the section how to pretrain and finetune the RE model based on the entity pair embedding z (es,eo) .

Pretraining
We pretrain the LM on the document corpus using the MTB objective. MTB is based on a simple assumption that, in contrast to different entity pairs, it is more frequent for the same entity pair to be connected with the same relation. The MTB objective transforms the similarity learning problem into a pairwise binary classification problem: given two relation-describing utterances where entity mentions are masked, the model classifies whether the entity pairs are the same or not. This pretraining objective has shown effectiveness in several sentence-level RE datasets (Zhang et al., 2017;Hendrickx et al., 2010;Han et al., 2018).
However, when it comes to document-level RE, Qin et al. (2021) have observed no improvement led by the vanilla MTB pretraining. Therefore, we replace the pairwise binary classification with contrastive learning, which is adopted in later RE pretraining works (Peng et al., 2020;Qin et al., 2021) and can effectively learn from more positive and negative examples. Details of training objectives are elaborated in the rest of the section. We introduce the details of data preprocessing of the pretraining corpus in Appendix A.
Training objective. The overall goal of pretraining is to make the embedding of the same entity pair from different documents more similar than different entity pairs. For clarity, we call two same entity pairs from different documents as a positive pair, and two different entity pairs as a negative pair. We use the InfoNCE loss (Oord et al., 2018) to model this objective. Given the documents in batch, P as the set of all positive pairs, and N t denote the set of entity pairs different to t, the contrastive MTB loss is 1 : denotes the similarity between the embeddings of t i and t j , and τ is a temperature hyperprameter. Following Chen et al. (2020), we use cosine similarity as the similarity metric. Similar to SimCSE , we further add a self-supervised contrastive loss that requires the same entity pair embedding augmented by different dropout masks to be similar, thus encouraging the model to learn more instance-discriminative features that lead to less collapsed representations. Specifically, denote the two entity pair embeddings of t derived by different dropout masks as z t andẑ t , respectively, the set of all entity pairs in the batch as T , and the set of entity pairs in positive pairs as T P , the self-supervised loss is: Finally, we use a masked language model loss L mlm to adapt the LM to the document corpus. The overall pretraining objective is: For faster convergence, we initialize our model with a PLM that is pretrained on a larger corpus, and continually pretrain the PLM on the document corpus with our new pretraining objectives. We use BERT (Devlin et al., 2019) for the general domain and PubmedBERT (Gu et al., 2021) for the biomedical domain.

Finetuning
After pretraining, we finetune the LM on labeled document-level RE datasets. In previous studies (Baldini Soares et al., 2019;Peng et al., 2020;Qin et al., 2021), pretraining and finetuning are conducted in processes with different learning objectives. Specifically, after using the pretrained weights to initialize a RE classifier, the model is finetuned with a classification-based training objective. Based on our model architecture, a straightforward finetuning method is to add a softmax classifier upon the entity pair embedding, for which a cross-entropy loss for a batch of entity pairs T is formulated as: where y t is the ground-truth label for entity pair t, W r , b r are the weight and bias of the classifier. Though this approach has shown improvements, it may produce sub-optimal outcomes from MTB pretraining since it implicitly assumes that entity pairs corresponding to the same relation are in the same cluster, while MTB pretraining may learn multiple clusters for a relation. For example, the entity pairs (Honda Corp., Japan) and (Mount Fuji, Japan), although likely to be expressed with the same relation country in documents, are likely to be in different clusters since MTB views them as negative pairs due to different subject entities. Therefore, we propose an MCCL objective that can bridge these gaps. Next, we will discuss the distributional assumption of the softmax classifier as well as supervised contrastive loss, then present our MCCL objective. Distributional assumption. We conduct a probing analysis on the distribution of pretrained representations to further justify the multi-cluster assumption. Specifically, we fix the weights of the pretrained MTB model and fit different classifiers on top of it, including a softmax classifier, a nearest centroid classifier (both assuming one cluster for a relation), and a classwise kNN classifier (assuming multiple clusters for a relation). We evaluate these classifiers on the test set. Results are shown in Table 1. We find that classwise kNN greatly outperforms others, showing that MTB pretraining learns multiple clusters for a relation. Therefore, to accommodate this multi-cluster assumption, we need to finetune the representations with a training objective that suits multiple clusters for each relation. Beside using the softmax classifier with cross-entropy loss, we also consider supervised contrastive loss (SupCon; Khosla et al. 2020;Gunel et al. 2021). SupCon has a similar loss form to InfoNCE in Eq. (1), except that it uses instances of the same/different relations as positive/negative pairs. However, previous work (Graf et al., 2021) has shown that both softmax and SupCon are minimized when the representations of each class collapse to the vertex of a regular simplex. In our case, this means the entity pair embeddings corresponding to the same relation in pretraining collapses to a single point, which creates a distributional gap between pretraining and finetuning. Training objective. We thereby propose the MCCL objective. Given entity pairs T and sets of entity pairs grouped by their relations {T r } r∈R , our loss is formulated as: where τ 1 and τ 2 are temperature hyperparameters, b r ∈ R is the classwise bias. The loss calculation can be split into two steps. First, we calculate the similarity between t i and relation r, which is a weighted average of the similarity between t i and t j ∈ T r such that a more similar t j has a larger weight. Next, we use the cross-entropy loss to make the similarity of ground-truth relation larger than others. In this way, MCCL only optimizes t i to be similar to a few closest entity pairs of the ground-truth relation, and thus encourages multiple clusters in entity pair embedding. Note that MCCL can be easily extended to support multilabel classification scenarios, for which details are given in Appendix B.
Proxies. We use batched training for finetuning, where entity pairs in the current batch are used to calculate MCCL. However, it is possible that a subset of relations in R, especially the long-tail relations, are rare or missing in the current batch. When T r \{t i } is empty, s t i r and MCCL become undefined. To tackle this problem, we propose the use of proxies (Movshovitz-Attias et al., 2017;Zhu et al., 2022). We add one proxy vector p r for each relation r, which is a trainable parameter and associated with an embedding z p r . We incorporate the proxies into MCCL by changing T r to T ′ r = T r ∪ {p r }, ensuring that T ′ r \{t i } is never empty in training and preventing MCCL from becoming undefined. The proxies are randomly initialized and updated during training by backward propagation.

Inference
We use the classwise kNN (Christobel and Sivaprakasam, 2013) for inference, which predicts relations based on similarly represented instances and thus aligns with our contrastive finetuning objective. Given a new entity pair to predict, we first find k most similar instances 2 in the training data of each relation (including NA), then calculate the average cosine similarity of each relation s avg r . Finally, the model returns the relation with the maximum s avg r + b r for single-label prediction, and all relations with higher s avg r + b r than NA for multilabel prediction. We use classwise kNN because it is more suitable for RE datasets, where the label distribution is usually long-tailed (Zhang et al., 2019).

Experiments
We evaluate our proposed method with a focus on low-resource RE (Sections 4.1-4.3), and present detailed analyses (Section 4.4) and visualization (Section 4.5) to justify method design choices.

Datasets
We conduct experiments with two documentlevel RE datasets. The BioRED dataset (Luo et al., 2022) is a manually labeled single-label RE dataset in the biomedical domain. The entity pairs are classified into 9 types (including an NA type indicating no relation). It has a training set consisting of 400 documents, which we use in finetuning. For pretraining, we use the PubTator Central corpus , which annotates the PubMed corpus with entity mentions and their named entity types. The Re-DocRED dataset (Tan et al., 2022b) is a multi-label largescale dataset of the general domain. It is a relabeled version of the DocRED dataset (Yao et al., 2019). Re-DocRED addresses the incomplete annotation issue of DocRED, where a large percentage of entity pairs are mislabeled as NA. The entity pairs in Re-DocRED are classified into 97 types (incl. NA). It has a training set consisting of 3,053 documents, which we use in finetuning. For pretraining, we use the distantly labeled training set provided by DocRED, which consists of 101,873 documents. We remove the relation labels and use our improved MTB to pretrain the model.

Experimental Setup
Model configurations. We implement our models using Hugging Face Transformers (Wolf et al., 2020). We use AdamW (Loshchilov and Hutter, 2018) in optimization with a weight decay of 0.01. During pretraining, we use a batch size of 16, a learning rate of 5e-6, a temperature of 0.05, and epochs of 3 and 10 for BioRED and DocRED, respectively. During finetuning, we use a batch size of 32, a learning rate of 5e-5, and epochs of 100 and 30 for BioRED and DocRED, respectively. The temperatures in MCCL are set to τ 1 = τ 2 = 0.2 for BioRED and τ 1 = 0.01, τ 2 = 0.03 for DocRED. We search k from {1, 3, 5, 10, 20} for classwise kNN using the development set 3 . We run experiments with Nvidia V100 GPUs.
Evaluation settings. In this work, in addition to the standard full-shot training, we consider lowresource settings. To create each of the settings, we randomly sample a fixed proportion p% of the entity pairs from the training set as our training data, and use the original test set for evaluation. We use the same evaluation metrics as the original papers. We use micro-F 1 for BioRED, and micro-F 1 and micro-F 1 -Ign for Re-DocRED. The micro-F 1 -Ign removes the relational facts in the test set that have appeared in training.
Compared methods. We experiment with the following finetuning objectives: (1) Lazy learning, which directly uses the pretrained embedding and training data to perform kNN without finetuning; (2) Cross-entropy loss (CE), which adds a softmax classifier on top of PLM and uses crossentropy loss to finetune the model; (3) Supervised contrastive loss (SupCon); and (4) Multicenter contrastive loss (MCCL). In inference, classwise kNN is used for all methods except for CE. Note that as SupCon does not apply to multilabel scenarios, we only evaluate it on BioRED. For each objective, we also evaluate the PLM before and after MTB pretraining. We use different PLMs as the backbone of the model, namely PubmedBERT BASE for BioRED and BERT BASE for Re-DocRED, which are pretrained on the biomedical and general domains, respectively.

Main Results
The results on the test sets of Re-DocRED and BioRED are shown in Table 2 and Table 3, respectively. All results are averaged for five runs of training using different random seeds. Overall, the combination of MTB and MCCL achieves the best performance in low-resource settings where 1%, 5%, and 10% of relation-labeled data are used. Further, when using the same MTB-based  representations, MCCL shows better results than CE in low-resource settings. It shows that in low-resource settings, MCCL can better leverage the pretraining knowledge with a well-aligned finetuning objective. However, this improvement diminishes when abundant labeled data are available, as MCCL underperforms CE on both datasets with full training data on both datasets. In addition, we observe that MTB pretraining consistently improves MCCL and CE on both datasets. These results demonstrate the effectiveness of MTB pretraining for more precise document-level RE with less needed end-task supervision.
Considering other training objectives, we observe that lazy learning produces meaningful results. On both datasets, the results of lazy learning based on MTB with 10% of data are comparable to finetuning with 1% of data. This shows that the entity pair embedding pretrained on unlabeled corpora contains knowledge that can be transferred to unseen relations. We also observe that SupCon using kNN-based inference underperforms both CE and MCCL on BioRED, showing that its one-cluster assumption hurts the knowledge transfer.

Ablation Study
Pretraining objectives. We analyze the effectiveness of our proposed pretraining losses in Section 3.2. To do so, we pretrain the model with one loss removed at a time while keeping the finetuning setup on BioRED fixed with the MCCL. The results are shown in Table 4. Overall, we observe that all losses are effective. If we remove all proposed techniques and use the vanilla MTB pretraining objective of binary pairwise classification, the results are only slightly better or even worse. Among the techniques, removing L rel leads to the largest performance drop, showing that MTB-based pretraining is critical to improve low-resource RE. Removing L self also leads to a large performance drop. It is because L self encourages the model to learn more discriminative features that lead to less collapsed representations. Our finding aligns with recent studies in computer vision (Islam et al., 2021;, showing that reducing collapsed representations with self-supervised contrastive learning improves the transferability to downstream tasks.
Performance w.r.t. different temperatures. We discuss the impact of two temperatures in MCCL. In MCCL, τ 1 controls the weighting of instances.
With a very small τ 1 , each instance will only form a cluster with its nearest neighbor in the batch, while with very large τ 1 , instances of the same relation will collapse to the same cluster. τ 2 controls the importance of hard instances, which is also used in other contrastive losses (e.g., τ in Eq. (1)). Wang and Liu (2021) observe that small τ 2 makes the model focus more on hard instances, while Khosla et al. (2020) observe that too small τ 2 leads to numerical instability. We show the results of using different temperatures in Figure 1, where we keep one temperature fixed and change the other. For τ 1 , we find that using large temperature harms the performance, showing that our multi-cluster assumption improves low-resource RE. For τ 2 , we observe that both small and large values impair the performance, which is aligned with prior observations.
Performance w.r.t. different amount of data.
The main results show that MCCL outperforms CE in the low-resource setting, while slightly underperforming CE when full training data is used. We further evaluate MCCL and CE using different amounts of end-task data. We experiment on BioRED and use the entity pair embedding pretrained with MTB. Results are shown in Figure 2. We observe that MCCL consistently outperforms CE by a large margin when less than 20% of training data is used, while it performs similarly or worse than CE after that. It again demonstrates the effectiveness of MCCL in low-resource RE. However, as the pretraining and finetuning are based on different tasks, fully adapting the model to downstream data by CE results in similar or better performance in data-sufficient scenarios. The visualization shows that both CE and Sup-Con learn one cluster for each relation, while lazy learning and MCCL, as expected, generate multiple small clusters for a relation. This observation indicates that MCCL can better align with the pretraining objective, further explaining its better performance in low-resource settings.

Conclusion
In this paper, we study self-supervised learning for document-level RE. Our method conducts an improved MTB pretraining objective that acquires cheap supervision signals from large corpora without relation labels. To bridge the gap between pretraining and end-task finetuning, we propose a continual contrastive finetuning objective, in contrast to prior studies that typically use classification-based finetuning, and use kNNbased inference. As pretrained representation may form multi-cluster representation, we further propose a multi-center contrastive loss that aligns well with the nature of the pretrained representation. Extensive experiments on two documentlevel RE datasets demonstrate the effectiveness of these key techniques in our method. Future work is adapting our method to other tasks in information extraction, such as n-ary relation extraction, named entity recognition, typing, and linking.

Limitations
The main limitation of MCCL is the requirement of a sufficiently large batch size in training (32 documents in our experiments), leading to a need for large GPU memory. This is because MCCL uses in-batch entity pairs for contrastive learning, and a small batch size does not provide enough instances to form multiple clusters. In addition, we need to store the entity pair embedding of the whole training set for kNN-based inference, which is less memory-efficient than CE.

B Adaptation to Multi-label RE
It is noteworthy that in some RE tasks, such as DocRED, one entity pair may have multiple relation labels, in which case the cross-entropy loss does not apply. Therefore, for multi-label scenarios, we substitute cross-entropy loss (also the softmax in MCCL) with the adaptive thresholding loss proposed by . Specifically, denote the logits as l (the input to softmax in crossentropy loss), the set of positive relations as P (except NA), and the set of the remaining relations except for NA as N , the adaptive thresholding loss is formulated as: This loss encourages the logits of positive relations to be higher than NA, and the logits of other relations to be lower than NA. In prediction, the model returns the relations with higher logits than

C More Experiments
Performance w.r.t. number of proxies. We evaluate MCCL with different number of proxies. When no proxy is used, we ignore the relations that do not appear in the current batch. The F 1 on both BioRED and Re-DocRED in the 1% low-resource setting is shown in Figure 4, indicating that adding proxies improves F 1 significantly on both datasets. Using one proxy for each relation achieves an increase of 6.0% in F 1 on BioRED, and a larger increase of 10.2% in F 1 on Re-DocRED. Such a difference of increment is due to the fact that Re-DocRED is more longtailed, where 97% of instances are NA compared to 80% in BioRED. We also observe that adding more proxies achieves similar or even worse results. These results make sense as the proxies are mainly in the place of long-tail relations that do not appear in the batch, and these relations contain too few instances to form multiple clusters.
Coarse-to-fine evaluation. To give another illustration of showing that MCCL learns multiple clusters, we experiment with it on 1% of BioRED in a coarse-to-fine setting. Specifically, we merge all relations except NA into one relation in finetuning, and apply kNN inference using the original labels. We find that MCCL achieves an F1 of 30.3%, which is even better than CE with all relations provided. However, if we remove the instance weights in MCCL to degrade it to onecluster, the F 1 constantly degrades in finetuning. It shows that multi-cluster assumption helps preserve the fine-grained relation information in pretrained representation. Other ablation studies. We analyze the effectiveness of entity type filtering in Section A. Results are shown in Table 5. Removing entity type filtering degrades performance significantly. It shows that entity type filtering can remove a lot of false negatives in pretraining and greatly improves the pretrained model. Besides, as the main results have demonstrated the effectiveness of MCCL in finetuning, we wonder whether MCCL can also lead to improved pretraining. To do so, we replace the InfoNCE loss in Eq. (1) by MCCL and regard different entity pairs as different classes. The results are comparable or slightly worse in contrast to using L rel , showing that the multi-cluster assumption of MCCL does not necessarily help pretraining.