Noise-Robust Semi-Supervised Learning for Distantly Supervised Relation Extraction

,


Introduction
Relation extraction (RE) is a fundamental process for constructing knowledge graphs, as it aims to predict the relationship between entities based on their context.However, most supervised RE Bag of <Obama, United states> Obama was born in the United States.
Obama is the 44th president of the United States.
Obama is a household name in the United States.techniques require extensive labeled training data, which can be difficult to obtain manually.To address this issue, Distant Supervision (DS) was proposed (Mintz et al., 2009) to automatically generate labeled text corpus by aligning plain texts with knowledge bases (KB).For instance, if a sentence contains both the subject (s) and object (o) of a triple ⟨s, r, o⟩ (⟨subject, relation, object⟩), then the DS method considers ⟨s, r, o⟩ as a valid sample for that sentence.Conversely, if no relational triples apply, then the sentence is labeled as "NA".
Distantly supervised datasets usually face high label noise in training data due to the annotation process.To mitigate the impact of noisy labels caused by distant supervision, contemporary techniques (Lin et al., 2016;Alt et al., 2019;Chen et al., 2021b;Li et al., 2022;Dai et al., 2022) usually employ multi-instance-learning (MIL) framework or modify MIL to train the relation extraction model.
Although MIL-based techniques can identify bag relation labels, they are not proficient in precisely mapping each sentence in a bag with explicit sentence labels (Feng et al., 2018;Jia et al., 2019;Ma et al., 2021).Several studies have focused on improving sentence-level DSRE and have empirically demonstrated the inadequacy of bag-level methods on sentence-level evaluation.(Feng et al., 2018;Qin et al., 2018) apply reinforcement learning to train a sample selector.(Jia et al., 2019) iden-tify confident samples by frequent patterns.(Ma et al., 2021) utilizes negative training to separate noisy data from the training data.
However, these methods has two main issues: (1) These works simply discard all noisy samples and train a relation extraction model with selected samples.However, filtering out noisy samples directly results in the loss of useful information.(Gao et al., 2021a) notes that the DSRE dataset has a noiserate exceeding 50%.Discarding all these samples would result in a significant loss of information.
(2) The confident samples selection procedure is not impeccable, and there may still exist a small amount of noise in the chosen confident samples.Directly training a classifier in the presence of label noise is known to result in noise memorization.
To address the two issues, this work proposes a novel semi-supervised-learning framework for sentence-level DSRE, First, we construct a K-NN graph for all samples using the hidden features.Then, we identify confident samples from the graph structure and consider the remaining samples as noisy.For issue (1): Our method discards only the noisy labels and treats corresponding samples as unlabeled data.We then utilize this unlabeled data by pseudo labeling within our robust semi-supervised learning framework to learn a better feature representation for relation.For issue (2): Despite our initial selection of confident samples, there may still be noise in the labeled dataset.we have developed a noise-robust semi-supervised learning framework that leverages mixup supervised contrastive learning to learn from the labeled dataset and curriculum pseudo labeling to learn from the unlabeled dataset.
To summarize the contribution of this work: • We propose a noise-robust Semi-Supervised-Learning framework SSLRE for DSRE task, which effectively mitigate the impact of noisy labels.
• We propose to use graph structure information (weighted K-NN) to identify the confident samples and effectively convert noisy samples as useful training data by utilizing pseudo labeling.
• The proposed framework achieves significant improvement over previous methods in terms of both sent-level and bag-level relation extraction performance.
2 Related work

Distantly Supervised Relation Extraction
Relation extraction (RE) is a fundamental pro-cess for constructing knowledge graphs (Zhang et al., 2023a;Xia et al., 2022;Zhang et al., 2023b).To generate large-scale auto-labeled data without human effort, (Mintz et al., 2009) first use Distant Supervision to label sentences mentioning two entities with their relations in KGs, which inevitably brings wrongly labeled instances.To tackle the predicament of noise, most of the existing studies on DSRE are founded on the multi-instance learning framework.This approach is leveraged to handle noisy sentences in each bag and train models by capitalizing on the constructed trustworthy baglevel representations.Usually, these methods employ attention mechanisms to assign less weights to the probable noisy sentences in the bag (Lin et al., 2016;Han et al., 2018b;Alt et al., 2019;Shen et al., 2019;Ye and Ling, 2019;Chen et al., 2021a;Li et al., 2022), apply adversarial training or reinforcement learning to remove the noisy sentences from the bag (Zeng et al., 2015;Qin et al., 2018;Shang and Wei, 2019;Chen et al., 2020;Hao et al., 2021).However, the studies (Feng et al., 2018;Jia et al., 2019;Ma et al., 2021;Gao et al., 2021a) indicate that the bag-level DSRE methods are ineffective for sentence-level relation extraction.This work focuses on extracting relations at the sentence level, (Feng et al., 2018) applied reinforcement learning to identify confident instances based on the reward of noisy labels.(Jia et al., 2019) involve building initial reliable sentences based on several manually defined frequent relation patterns.(Ma et al., 2021) assigning complementary labels that cooperate with negative training to filter out noisy instances.Unlike previous studies, our method only discard noisy labels and keep the unlabeled samples.We use pseudo labeling to effectively utilize unlabeled samples, which helps to learn better representation.

Semi-Supervised-Learning
In SSL, a portion of training dataset is labeled and the remaining portion is unlabeled.SSL has seen great progress in recent years.Since (Bachman et al., 2014) proposed a consistency regularization based method, many approaches have migrated it into the semi-supervised learning field.MixMatch (Berthelot et al., 2019) proposes to combine consistency regularization with entropy minimization.
Mean Teacher (Tarvainen and Valpola, 2017) and Dual Student (Ke et al., 2019) are also based on consistent learning, aiming for the same outputs for different networks.Recently, FixMatch (Sohn et al., 2020) provides a simple yet effective weak-tostrong consistency regularization framework.Flexmatch (Zhang et al., 2021) provides curriculum pseudo pesudo labels learning to combat the imbalance of pseudo labels.
Our SSLRE framework differs from these frameworks in two main ways.Firstly, our labeled dataset still contains a small amount of noise due to the fact that confident sample identification cannot achieve perfection.Therefore, we utilize mixup contrastive supervised learning to combat this noise.Secondly, current SSL methods generate and utilize pseudo labels with the same head, which causes error accumulation during the training stage.To address this issue, we propose utilizing a pseudo classifier head, which decouples the generation and utilization of pseudo labels by two parameterindependent heads to avoid error accumulation.

Learning with Noisy Data
In both computer vision and natural language processing, learning with noisy data is a widely discussed problem.Existing approaches include but not limit to estimating the noise transition matrix (Chen and Gupta, 2015;Goldberger and Ben-Reuven, 2016), leveraging a robust loss function (Lyu and Tsang, 2019;Ghosh et al., 2017;Liu and Guo, 2019), introducing regularization (Liu et al., 2020;Iscen et al., 2022), selecting noisy samples by multi-network learning or multi-round learning (Han et al., 2018a;Wu et al., 2020), re-weighting examples (Liu and Tao, 2014), generating pseudo labels (Li et al., 2020;Han et al., 2019), and so on.In addition, some advanced state-of-the-art methods combine serveral techniques, e.g., Dividemix (Li et al., 2020) and ELR+ (Liu et al., 2020).
In this paper, we address the issue of noisy labels in distant relation extraction.Our approach first involves constructing a K-NN graph to identify confident samples based on their graph structure, and then use noise-robust mixup supervised contrastive learning to train with the labeled samples.

Methodology
To achieve sentence-level relation extraction in DSRE, we propose a framework called SSLRE, which consists of two main steps.Firstly, we select confident samples from the distantly supervised dataset using a weighted K-NN approach built by all sample representations.We use the selected samples as labeled data and the remaining samples as unlabeled data (as detailed in Section 3.1).Secondly, we employ our robust Semi-Supervised Learning framework to learn from the Semi-Supervised datasets (as described in Section 3.2).Appendix A delineates the full algorithm.
Specifically, we denote the original dataset in this task as The labeled dataset (identified confident samples) is denoted as

Confident Samples Identification with
Weighted K-NN Our Semi-Supervised-Learning module requires us to divide the noisy dataset into a labeled dataset and an unlabeled dataset.Inspired by (Lee et al., 2019;Bahri et al., 2020), we utilize neighborhood information of the hidden feature spaces to identify confident samples We employ supervised contrastive learning to warm up our model and obtain the representations for all instances.It is noteworthy that deep neural networks tend to initially fit the training data with clean labels during an early training phase, prior to ultimately memorizing the examples with false labels (Arpit et al., 2017;Liu et al., 2020).Consequently, we only warm up our model for a single epoch.Given two sentences s i and s j , we can obtain their low-dimensional representations as z i = θ(s i ) and z j = θ(s j ), where θ is the sentence encoder.We then calculate their representation similarity using the cosine distance Then, we build a weighted K-NN graph for all samples based on the consine distance.To quantify the agreement between s i and ỹi , We first use the label distribution in the K-neighborhood to approximate clean posterior probabilities, θ w and θ s mean forward with lower and higher dropout rate, respectively.ϕand ψ are classifier head and pseudo classifier head.L s is the mixup supervised contrastive loss defined in Eq. ( 12), and L u,t is the unsupervised loss defined in Eq. ( 6).
where N i represents the set of K closest neighbors to s i .We then use the cross-entropy loss ℓ to calculate the disagreement between qc (s i ) and ỹi .Denoting the set of confident examples belonging to the c-th class as X c , we have (3) where γ c is a threshold for the c-th class, which is dynamically defined to ensure a class-balanced set of identified confident examples.To achieve this goal, we use the α fractile of per-class agreements between the original label ỹi and max c qc (s i ) across all classes to determine how many examples should be selected for each class, i.e.
Finally, we can get the labeled set and unlabeled set as (4)

Noise-Robust Semi-Supervised learning
Despite selecting confident samples from the distantly supervised dataset, there still remains a small amount of noise in the labeled dataset.Naively training a classifier in the presence of label noise leads to noise memorization (Liu et al., 2020), which degrades the performance.We propose a noise-robust semi-supervised learning framework to mitigate the influence of remaining noise.

Data Augmentation with Dropout
Inspired by SimCSE (Gao et al., 2021b), we augment training samples by embedding processing.
In particular, We obtain different embeddings of a sentence by applying dropout during the forward process.Additionally, we propose using a high dropout rate for strong augmentation and a low dropout rate for weak augmentation.The sentence encoder is denoted as θ, with forward propagation using a high dropout rate denoted as θ s and forward propagation using a low dropout rate denoted as θ w .

Unsupervised Learning with Pseudo Labeling
In this part, we propose two modules to learn from the unlabeled dataset: (1) To generate and utilize pseudo labels independently, we propose pseudo classifier head.(2) Utilize Curriculum Pseudo Labeling to perform consistent learning while combating the unbalance of the generated pseudo labels.Pseudo classifier head: pseudo labeling is one of the prevalent techniques in semi-supervised learning.The existing approaches generate and utilize pseudo labels with the same head.However, this may cause training bias, ultimately amplifying the model's errors as self-training continues.(Wang et al., 2022).To reduce this bias when using pseudo labels, we propose utilizing a two-classifier model consisting of an encoder θ with both a classifier head ϕ and a pseudo classifier head ψ.We optimize the classifier head ϕ using only labeled samples and without any unreliable pseudo labels from unlabeled samples.Unlabeled samples are used solely for updating encoder θ and pseudo classifier head ψ.In particular, the classifier head ϕ generates pseudo labels (ϕ • θ w )(u b ) for unlabeled samples (which have no gradient), the loss of unlabeled samples is calculated by ℓ((ψ , where ℓ denotes cross entropy loss.This decouples the generation and utilization of pseudo labels by two parameter-independent heads to mitigate error accumulation. Curriculum Pseudo Labeling: Due to the highly unbalanced dataset, using a constant cut-off τ for all classes in Pseudo labeling results in almost all selected samples (those with confidence greater than the cut-off) being labeled as 'NA', which is the dominant class.
Inspired by Flexmatch (Zhang et al., 2021), we use Curriculum Pseudo Labeling (CPL) to combat unbalanced pseudo labels.We first generate pseudo labels for iteration t These labels are then used as the target of stronglyaugmented data.The unsupervised loss term has the form as where and σ t (c) represents the numbers of the samples whose predictions fall into class c and above the threshold, formulated as

Mixup Supervised Contrastive Learning
We target learning robust relation representation in the presence of remaining label noise.In particular, we adopt the contrastive learning approach and randomly sample N sentences and inference the sentences twice with same dropout rate to get two view.Then we normalize the embedding by L 2 normalization.The resulting minibatch {z i , y i } 2N i=1 consists of 2N normalized sentence embedding and corresponding labels.We perform supervised contrastive learning on labeled samples To make representation learning robust, we add Mixup (Berthelot et al., 2019) to supervised contrastive learning.Mixup strategies have demonstated excellent performance in classification frameworks and have futher shown promising results to prevent label noise memorization.Inspired by this success, we propose mixup supervised contrastive learning, a novel adaptation of mixup data augmentation for supervised contrastive learning.
Mixup performs convex combinations of pairs of samples as where λ ∈ [0, 1] ∼ Beta(α m , α m ) and x i denotes the training example that combines two mini-batch examples x a and x b .A linear relation in the contrastive loss is imposed as where L a and L b have the same form as L i in Eq. ( 9).The supervised loss is the sum of Eq. ( 11) for each mixed instance: Mixup supervised contrastive learning helps to learn a robust representation for relations, but it cannot map the representation to a class.To learn the map function from the learned representation to relation class, classification learning using cross entropy loss is also employed as (13)

Training Objective
Combining the above analyses, the total objective loss is 4 Experiments

Evaluation and Parameter Settings
To guarantee the fairness of evaluation.We take both sentence-level evaluation and bag-level evaluation in our experiments.Further details of the evaluation methods are available in the appendix C. To achieve bag-level evaluation under sentence-level training, we use at-least-one (ONE) aggregation stragy (Zeng et al., 2015), which first predicts relation scores for each sentence in the bag, and then takes the highest score for each relation.The details of the hyper-parameters are available in the appendix D.

Baselines
In order to prove the effectiveness of the SSLRE, we compare our method with state-of-the-art methods sentence-level DSRE framework and bag-level DSRE framwork.
For bag-level methods baselines, RESIDE (Vashishth et al., 2018) exploits the information of entity type and relation alias to add a soft limitation for DSRE.DISTRE (Alt et al., 2019) combines the selective attention to its Transformer-based model.CIL (Chen et al., 2021a) utilize contrastive instance learning under MIL framwork, HiCLRE (Li et al., 2022) propose hierarchical contrastive learning framwork, PARE (Rathore et al., 2022) propose concatenate all sentences in the bag to attend every token in the bag.Besides, we combine Bert with different aggregation strategies: ONE, which is mentioned in section 4.2; AVG averages the representations of all the sentences in the bag; ATT (Lin et al., 2016) produces bag-level representation as a weighted average over embeddings of sentences and determines weights by attention scores between sentences and relations.
For sentence-level baselines, RL-DSRE (Feng et al., 2018) apply reinforcement learning to train sample selector by feedback from the manually designed reward function.ARNOR (Jia et al., 2019) selects the reliable instances based on reward of attention score on the selected patterns.SENT (Ma et al., 2021) filters noisy instances based on negative training.

Results
We first evaluate our SSLRE framework in the NYT10m and WIKI20m dataset.Table 1 shows the overall performance in terms of sentence-level evaluation.From the results, we can observe that (1) Our SSLRE framework demonstrates superior performance on both datasets, surpassing all strong baseline models significantly in terms of F1 score.In comparison to the most robust baseline models in two distantly supervised datasets, SSLRE displays a significant enhancement in performance (i.e., +6.3% F1 and +3.4% F1). ( 2) The current sentence-level DSRE models (SENT, ARNOR) fail to outperform the state-of-the-art MIL-based techniques in terms of F1 score on the aforementioned datasets.This could be attributed to the loss of information resulting from the elimination of samples.Unlike the MIL-based methods that employ all samples for training, these models only utilize selected samples.Moreover, the selection procedure may not always be reliable.(3) The performance of state-of-the-art MIL-based methods is not substantially superior to that of the Bert baseline.This suggests that the MIL modules, which are specifically crafted for this task, do not exhibit significant effectiveness when evaluated at the sentence level.
Table 2  MIL framework, our SSLRE framework achieves state-of-the-art performance on bag-level relation extraction.This finding suggests that sentencelevel training can also yield excellent results on baglabel prediction.This observation is also consistent with (Gao et al., 2021a;Amin et al., 2020).On the wiki20m dataset, we note a consistent improvement on as well, although it is not as evident as in the case of NYT10m.We surmise that this could be attributed to the fact that the wiki20m dataset is relatively less noisy when compared to NYT10m.
We also compared our framework to several strong baselines using held-out evaluation on the NYT10 dataset, which is detailed in appendix B.

Ablation Study
We conducted ablation study experiments on the NYT10m dataset to assess the effectiveness of different modules in our SSLRE framework.We specifically removed each of the argued contributions one by one to evaluate their effectiveness.For unsupervised learning part, we remove the pseudo classifier head and CPL one at a time.For supervised learning part, we switch from mixup supervised learning to supervised contrastive learning and cross entropy as our new objective.In terms of confident samples identifications methods, we alternate between using random (randomly selection) and NLI-based selection instead of our K-NN method.The NLI method involves performing zero-shot relation extraction through Natural Lan-guage Inference (NLI) (Sainz et al., 2021), then identify the confident samples based on the level of agreement between the distant label and NLI soft label.Table .3 shows the ablation study results.We conclude that (1) Unsupervised learning apart effectively utilize the unlabeled samples.Removing pseudo classifier head and CPL leads to a decrease of 1.3% and 5.1% on micro-F1, respectively.

Methods
(2) When dealing with noisy labeled data in supervised learning, Mixup Contrastive Supervised Learning proves to be more robust than both Supervised Contrastive Learning (-2.9%) and Cross Entropy (-4.2%).(3) Our K-NN-based confident samples identification method outperforms the random method by 6.5% and the NLI method by 3.3%.This indicates that our K-NN method can effectively select confident samples.

Analysis on KNN
We conducted an evaluation of the performance of weighted k-nearest neighbors (kNN) in terms of its ability to select confident samples.To elaborate, we intentionally corrupted the labels of instances in the nyt10m test set with a random probability of 20%, 40%, and 60%.Our objective was to assess whether our weighted kNN method could effectively identify the uncorrupted (confident) instances.We trained our model on the corrupted nyt10m test set for 10 epochs, considering its relatively smaller size compared to the training set, which required more epochs to converge.In order to evaluate the ability of the weighted kNN in identifying confident samples, we reported the recall and precision metrics.The results are shown as 4: It is worth noting that precision is the more important metric because our goal is to make the la-Pre. Rec.

t-SNE analysis
To demonstrate that preserving unlabeled samples can facilitate the learning of a superior representation compared to discarding them, we utilized sentence representations obtained from theta as the input to conduct dimension reduction via t-SNE and acquire two-dimensional representations.We focused on four primary categories of relation classes, which are "/location/location/contains", "/business/person/company", "/location/administra-tive_division/country", and "/people/person/nationality".As depicted in Figure 4, leveraging unlabeled samples via Pseudo labeling enhances the clustering of identical relation data points and effectively separates distinct classes from one another.
Appendix F shows the t-SNE results of all classes.

Conclusion
In this paper, we propose SSLRE, a novel sentence-level framework that is grounded in Semi-Supervised Learning for the DSRE task.Our SSLRE framework employs mixup supervised contrastive learning to tackle the noise present in selected samples, and it leverages unlabeled samples through Pseudo Labeling, which effectively utilize the information contained within noisy samples.
Experimental results demonstrate that SSLRE outperforms strong baseline methods in both sentencelevel and bag-level relation extraction.

Limitations
In order to augment textual instances, we leverage dropout during forward propagation.This necessitates propagating each instance twice to generate the augmented sentence embeddings.However, the demand for GPU resources is higher compared to previous methods.Furthermore, we adjust the dropout rate to regulate the augmentation intensity for semi-supervised learning and show its effectiveness through the performance results.Nonetheless, we have not conducted explicit experiments to investigate the interpretability, which needs further investigation.

A Algorithm
Algorithm 1 provides the pseudo-code of the overall framework.
B Held-out evaluation

C Evaluation Settings
Sentence-level evaluation: Different from baglevel evaluation used by MIL-based model, a sentence-level(or instance-level) evaluation accesses model performance directly on all of the individual instances in the dataset, which require the model to accurately predict the relation for each sentence.Following (Jia et al., 2019;Ma et al., 2021;Liu et al., 2022), we report micro-Precision(µPrec.),micro-Recall(µRec.) and micro-F1(µF1) for sentence-level evaluation.

D Parameter Settings
Bag-level evaluation: Bag evaluation accesses the performance of bag relation label extraction.Since manually annotated data are at the sentence-level, following (Gao et al., 2021a), we construct baglevel annotations in the following way: For each bag, if one sentence in the bag has a human-labeled relation, this bag is labeled with this relation; if no sentence in the bag is annotated with any relation, The underlying encoder for sentence are implemented by BERT-base (Devlin et al., 2019), which generates 768 hidden units for each token's contextaware representation.During the training stage, we set the learning rate of the model to 2 × 10 −5 and the batch size to 64, which was determined through a grid search over batch sizes in {16, 32, 64} and learning rates in {1e-5, 2e-5, 5e-5}.We train the model for 5 epochs and use Adam (Kingma and Ba, 2014) as the optimizer.The Mixup parameter α m is set to 1, the classifier loss weight λ c is set to 0.2, the fractile alpha is set to 0.8, the unsupervised loss weight λ u is set to 1, and the CPL threshold τ is set to 0.95.We set the dropout rate for weak augmentation to 0.2 and the dropout rate for strong augmentation to 0.4.Further analysis on the strong augmentation dropout rate is presented in Section 4.7.

E PR-curve
We report the P-R curve on NYT10m dataset as Figure 5:

F Additional t-SNE analysis
Figure 6 shows the t-SNE results on all classes of the sentence representation.
Algorithm 1: SSLRE Algorithm input :Noisy Dataset D output : 1 Warm up θ and ϕ for one epoch using supervised contrastive learning and get θ ′ .

Figure 1 :
Figure 1: Bag-level RE maps a bag of sentences to bag labels.Sentence-level RE maps each sentence to a specific relation.

Figure 2 :
Figure2: An overview of the proposed framework, SSLRE.D, X , U denote the original noisy dataset, labeled dataset and unlabeled dataset.θ indicates the encoder.θ w and θ s mean forward with lower and higher dropout rate, respectively.ϕand ψ are classifier head and pseudo classifier head.L s is the mixup supervised contrastive loss defined in Eq. (12), and L u,t is the unsupervised loss defined in Eq. (6).

Figure 3 :
Figure 3: Strong augmentation with different dropout rate

Figure 6
Figure6: t-SNE visualization of representations with Pseudo labeling(SSL) and without(Sup).SSL achieves a better cluster results comparing with Sup, especially on color green and light purple.

Table 1 :
Sentence-level evaluation results on NYT10m and wiki20m.Bold and underline indicate the best and the second best scores.

Table 2 :
Bag-level evaluation results on nyt10m and wiki20m.SSLRE-ONE represents the SSLRE with ONE aggregation strategy

Table 3 :
Ablation study of SSLRE on NYT10m

Table 4 :
The effect of KNN.
Mengqi Zhang, Yuwei Xia, Qiang Liu, Shu Wu, and  Liang Wang.2023a.Learning latent relations for temporal knowledge graph reasoning.In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12617-12631, Toronto, Canada.Association for Computational Linguistics.Mengqi Zhang, Yuwei Xia, Qiang Liu, Shu Wu, and Liang Wang.2023b.Learning long-and short-term representations for temporal knowledge graph reasoning.In Proceedings of the ACM Web Conference 2023, WWW '23, page 2412-2422, New York, NY, USA.Association for Computing Machinery.

Table 5 :
Held-out evaluation on NYT10