Self-distilled Transitive Instance Weighting for Denoised Distantly Supervised Relation Extraction

,


Introduction
Distantly Supervised Relation Extraction (DSRE) (Mintz et al., 2009) is designed to automatically annotate the sentences mentioning the entity pairs, which enables a significant way of constructing large-scale datasets.However, distant supervision (DS) works under an unrealistic assumption that all sentences mentioning the same entity pair express the same relation.This introduces many noisy (wrongly labeled) instances into the dataset.To tackle this challenge, previous works mostly adopt the bag-level setting as shown at the top of Figure 1, where the vector representations of sentences are aggregated as the bag-level representation using multi-instance learning (MIL) (Riedel et al., 2010), The mainstream encoders of DSRE models are Piecewise Convolutional Neural Network (PCNN) (Zeng et al., 2015) and Recurrent Neural Network (RNN) (Zhou et al., 2016;Liu et al., 2018) over the years.It is reasonable for most previous works to take the simple encoder as a black box and only utilize its final output during training and inference.However, as large models like BERT (Devlin et al., 2019) become popular in recent years, the information within the outputs from their intermediate layers is a non-trivial source of knowledge but is rarely discussed in DSRE.In this work, we apply self-distillation to extract intermediate information as output probabilities and utilize them to denoise from wrong labels.Furthermore, we use soft target selection and set up transitive knowledge passing among the students to alleviate the effects of noisy target probabilities from the teacher.
The instances in DSRE can be roughly divided into easy, hard and noisy ones.Both easy and hard instances are correctly labeled but the model learns from hard instances slower (Huang et al., 2021).Noisy instances have wrong labels and can be further divided into False Positives (FPs) and False Negatives (FNs).FPs are instances with NA relation but are wrongly labeled as non-NA relations by DS, while FNs are non-NA instances wrongly labeled as NA.We hope to avoid learning from noisy instances since they contain misleading information.Moreover, we also need to avoid overfitting to easy instances to improve the learning of deeper knowledge.
To tackle the above challenges, we propose a novel Transitive Instance Weighting (TIW) mechanism for denoised sentence-level training in DSRE.Firstly, we apply self-distillation to directly reuse the knowledge of the teacher model for further denoising in students and set up a transitive way to share knowledge among students.Secondly, we leverage the TIW mechanism to generate robust instance weights to reduce noise and overfitting during distillation.TIW considers two factors in the generation of instance weights: Uncertainty (Liu et al., 2020b) for overfitting prevention and general Consistency for noise reduction.The general Consistency we proposed reflects the learning difficulty of the instance and provides guidance both in selecting soft targets of distillation and in weighting instances.Lastly, the generated instance weights directly multiply the sentence-level losses to dynamically and globally enhance the training in the sentence-level setting.The experiments on both held-out and manual datasets show that our approach boosts the student's performance to achieve state-of-the-art results and consistent improvements over the teacher.We also provide an ablation study to explore the effects of the modules.In addition, we analyse the errors and provide additional experimental results in the Appendix.
Our contributions are summarized as follows: • We are the first to denoise sentence-level DSRE with dynamic instance weights and har-ness intermediate knowledge to improve noise resistance and information utilization.
• We propose a novel Transitive Instance Weighting mechanism with multiple functions, including noise alleviation, overfitting prevention, soft target selection and transitive knowledge passing.
• Experiment and analysis show that our method achieves state-of-the-art performance with good generalization and robustness.

Related Work
Distant supervision (DS) for relation extraction (Mintz et al., 2009) enables automatic annotation of large-scale datasets, but its strong assumption introduces a large number of wrongly labeled instances.Following Riedel et al. (2010), various multi-instance learning methods are proposed to denoise from noisy instances, and they broadly fall into two categories: instance selection (Zeng et al., 2015;Qin et al., 2018;Feng et al., 2018) and instance attention (Lin et al., 2016;Yuan et al., 2019b,a;Ye and Ling, 2019).Apart from multi-instance learning, many of the previous works try to improve the effectiveness of training.Liu et al. (2017) and Shang et al. (2020) try to convert wrongly labeled instances to useful information through relabeling.Huang and Du (2019) proposes collaborative curriculum learning for denoising.Hao et al. (2021) adopts adversarial training to filter noisy instances in the dataset.Nayak et al. (2021) designs a self-ensemble framework to filter noisy instances despite information loss.Li et al. (2022) proposes a hierarchical contrastive learning framework to reduce the effect of noise.Rathore et al. ( 2022) constructs a passage from the bags to generate a summary for classification.Nevertheless, the above approaches are trained with bag-level loss, leading to lower utilization of information.In our work, we adopt sentence-level training to directly utilize sentence-level information and effectively tackle noise and overfitting using dynamic instance weights.Knowledge distillation (Hinton et al., 2015) is an effective way to improve model generalization, though it has difficulty in transferring knowledge effectively (Stanton et al., 2021).By sharing some parameters between teacher and students, selfdistillation (Zhang et al., 2019a) improves knowledge transfer from teacher to student.Liu et al. (2020b) applies self-distillation on BERT (Devlin et al., 2019) to improve inference efficiency.However, in our work, we apply self-distillation as the tool to extract intermediate knowledge for denoising and further reduce the noise from the teacher with transitive information passing among the students.
There are some epoch-level techniques to detect noisy instances like Swayamdipta et al. (2020) and Huang et al. (2021).But in sentence-level DSRE which is highly noisy and contains bias from the entity mentions (Peng et al., 2020), larger models like BERT can overfit noisy instances faster, even before an epoch ends.Therefore, we adopt a dynamic instance weighting mechanism which is more suitable for DSRE.

Methodology
3.1 Overview DSRE aims to predict the relations between an entity pair given a bag of sentences mentioning them.Previous works mostly aggregate sentence representations into bag representations before being optimized using bag-level loss.However, useful information may be diluted or mixed during aggregation.Instead, our model is directly trained on the sentence level to preserve more information.
Our model is illustrated in Figure 2. The backbone is the BERT encoder on the left, with a teacher classifier on the top.Each student shares a subencoder with the teacher and uses a new classifier for prediction.Firstly, the encoder and the teacher classifier are fine-tuned on the dataset to establish background knowledge.Then, we freeze the encoder and train the student classifiers, in which knowledge distillation and Transitive Instance Weighting (TIW) are applied to reduce noise and overfitting.TIW computes the instance weights based on three sources of knowledge: the teacher's output p t , the outputs of the student i itself p s i and the previous peer p s i−1 .It first selects a more consistent soft target p tg i between p t and p s i−1 based on the probabilities of making the same predictions as them (i.e.Consistency), which are denoted as c t i and c s i respectively.Then the possible false negative instances are filtered according to the predictions p s i−1 from the previous peer.Finally, the instance weights w i are computed as the multiplication of Uncertainty u i (normalized entropy) and the general Consistency c i with the soft target.The details of TIW are given in Algorithm 1, where re2id(r) is a function that maps the relation class r to its id for generating the one-hot label.The instance weights directly multiply with the instance losses to dynamically regulate the roles of instances in optimization (Equation 6).This improves training by globally down-weighting the instances leading to extra noise and overfitting.

Backbone
BERT (Devlin et al., 2019) is a powerful transformer-based pretrained network with broad applications in natural language processing.Its intermediate layers encode a rich hierarchy of sentence features, ranging from surface features, and syntactic features, to semantic features (Jawahar et al., 2019).However, previous BERT applications in DSRE (Alt et al., 2019;Rao et al., 2022) only utilize the output from the final layer, neglecting the possibility that hierarchical intermediate information can be useful in denoising.Therefore, we set up the student classifiers to extract information from the hierarchical features in the form of output probabilities and utilize them to distinguish noisy instances in the distillation stage.
The model takes a batch of sentences as input, each is labeled with at least one relation.Firstly, each input sentence is transformed into a sequence of vector representations s by the embedding layer.Then, BERT conducts layer-wise feature extraction with the input s, the output of i th layer (1 ≤ i ≤ n) is described as: where BERT i refers to the subencoder containing transformer layers from the first to the i th .The

Algorithm 1 Transitive Instance Weighting
Input: DS label Y , teacher's output probability p t and students' p s for the instance.
Output: The soft target p tg and the instance weight w of the instance from the students.
Compute the Consistency with teacher and peer: Compute the Uncertainty of soft target: Compute the general Consistency and the instance weight: end if 11: end for encoder is fine-tuned with a simple feedforward classifier F F N t on the top and we can obtain the output of the teacher p t as in the following: (2) where p 1 and p 2 are the start positions of the head entity and tail entity respectively.[a : b] indicates the concatenation of vectors a and b. n c is the number of classes.Similarly, the output of student i can be formulated as follows: Note that after fine-tuning, the parameters of the teacher model including the BERT encoder stay fixed during the process of self-distillation.

Transitive Instance Weighting
TIW incorporates multiple mechanisms to reduce noise and overfitting.For negative (NA) instances, TIW adopts False Negative Filtering (FNF) to filter false negatives based on the prediction of the previous peer p s i−1 .For positive (non-NA) instances, TIW provides dynamic instance weights w i generated by multiplying the Uncertainty u i and the general Consistency c i .The Uncertainty u i is computed as the normalized entropy of the student's soft target p tg i as in Line 8 of Algorithm 1 and is applied to avoid overfitting to easy instances.The general Consistency c i evaluates the consistency between the student's output and its soft target p tg i to limit the effects of wrongly-labeled instances.
Most previous works in knowledge distillation directly use the teacher's output probability as the soft target.However, the teacher can constantly make mistakes if trained with noisy data, as in DSRE.Therefore, by introducing peer output p s into distillation, TIW sets up a transitive way to share knowledge among the students and reduces the noise from the teacher.As in Line 4 of Algorithm 1, instead of blindly following the output from the teacher, each student i (i > l) chooses between the teacher p t and the previous peer p s i−1 to follow.This step is referred to as Soft Target Selection (STS) later.STS provides additional referential probability distributions for the learning students so they can switch to a smoother target probability when the output from the teacher is too hard to follow.The criterion of selection is Consistency (c t i and c s i ), which is described as the probability of two systems making the same predictions and is computed as the dot product of the probability distributions from the two systems, as in Line 3 of Algorithm 1.
In TIW, we adopt different strategies for negative instances and positive ones because their characteristics are quite different.For negative instances, we conduct FNF as in Lines 5-6 of Algorithm 1.Since we have sufficient negative instances in the dataset, it is acceptable to avoid more FNs at the cost of slight information loss.Therefore, we assign 0 weight to all the possible FNs and 1 weight to the rest.To correctly identify FNs, we adopt a dynamic approach that if the previous peer agrees with distant supervision and also labels the instance as NA, then we classify the instance as a true negative.Otherwise, we assume it to be a false negative that the DS label is unreliable.The student follows the peer's view in FNF instead of the teacher's because the teacher already overfits the noisy data and mostly follows the DS label, though the probabilities of label relations may vary.
In order to preserve more information for training, we use soft weights for the positive instances instead of hard filtering.We call it Positive Weighting (PW) and determine the instance weight w i of student i by two factors: Uncertainty u i and the general Consistency c i with the selected soft target.
The uncertainty term is the normalized entropy as in Liu et al. (2020b) of the chosen soft target.It evaluates how well an instance is fitted so we can leverage it to detect overfitted instances dynamically.Easy instances contain shallow features like London, UK indicating a location/contains relation, so the model fits them easily and fast.But we do not hope the model becomes overdependent on them and lose focus on deeper features hidden in semantics.Therefore we discount their weights with uncertainty to prevent overfitting.
The general Consistency c i of student i is the Consistency between the student and the soft target.During distillation, each student is expected to stay consistent with its target distribution.If c i is high, the student successfully follows the prediction of the soft target, indicating that the instance is easy to learn for the student.If c i is low, the student fails to stay consistent with prior knowledge and the instance may be noisy or very hard to learn.The instance weight w i should take the prevention of both noise and overfitting into consideration, so it is empirically implemented as the multiplication of general Consistency c i and Uncertainty u i , as in Line 9 of the algorithm.
Note that during distillation, the student is trained with both soft targets and DS labels, as shown in Equation 6.We present the discussions on the c i and losses of easy, noisy and hard instances in the following.
Easy instances mostly have high c i and are wellfitted by the teacher or the peer, so the optimizations using soft targets and DS labels conform with each other.
Noisy instances are mostly underfitted and very hard to optimize because the soft targets and DS labels are mostly inconsistent.They have low c i because the teacher and the students are not likely to provide consistent predictions.
Hard instances are underfitted clean instances with low c i at first.However, their soft targets and DS labels are consistent, leading to steady optimizations.When clean background knowledge is established by learning from clean instances, learning from hard ones becomes easier so the c i values of hard instances grow larger.Based on the above discussions, it is safe to say that both easy and hard instances are faster to fit than noisy ones during distillation, indicating that TIW is capable of reducing noise in the training set.As for Uncertainty, its role is non-decisive.Both hard and noisy instances tend to have high Uncertainty but the hard ones have higher Consistency, leading to larger weights than noisy ones.Easy instances are fast to fit even with their weights discounted by low Uncertainty.Therefore, applying Uncertainty helps alleviate overfitting and does not lead to increases in noise.
To sum up, TIW aims to tackle noise and overfitting and thus can be combined with sentence-level training, which is more demanding in both noise reduction and overfitting alleviation than traditional bag-level training.

Optimization
The teacher and the peer may overfit noisy instances during fine-tuning and distillation.Therefore, we apply a dynamic temperature τ to the soft target in the following form: where γ is a hyperparameter empirically set as 3.
The idea of τ is to further smooth the well-fitted instances to produce softer targets.
The loss function of our model follows the general form of knowledge distillation with the instance weight w we propose: where α is a hyper-parameter empirically set as 0.5.KL τ (p, q) computes the KL-divergence between distributions p and q with temperature τ for the soft target q.Y is the label from distant supervision and CE(p, Y ) is the cross entropy loss with one-hot label obtained from Y .performance of our model compared with previous baselines and the teacher model.We also conduct an ablation study to enable a deeper understanding of the mechanisms.

Datasets and Settings
We use the widely used held-out dataset NYT-10 (Riedel et al., 2010) and recent manual dataset NYT-10m (Gao et al., 2021) for evaluation.As a standard dataset for DSRE, NYT-10 is constructed by aligning the relations in Freebase (Bollacker et al., 2008) with the New York Times (NYT) corpus (English).NYT-10m is a manual dataset constructed also from the NYT corpus, with a humanlabeled test set and a new relation ontology.For NYT-10, we divide the dataset into five parts for cross-validation.For NYT-10m, we use the provided validation set.The details of the datasets are shown in Table 1.

Dataset
Train (k) Test (k) Rel.Sen. Fac.Sen. Fac.held-out 522.6 18.4 172.4 2.0 53 manual 417.9 17.1 9.7 3.9 25 In the experiments, we use the bert-baseuncased checkpoint with about 110M parameters for initialization as in Han et al. (2019).We apply the AdamW (Loshchilov and Hutter, 2017) optimizer during distillation.The structure of the embedding layer and BERT layers follow those in the previous works with the number of transformer layers n = 12 and hidden size d h = 768.The batch size is 32 and the learning rate is 2e − 5.The maximum length of sentences m is 128.As discussed by Jawahar et al. (2019), the shallow layers may not be able to encode the information needed for the DSRE task.Therefore, TIW starts from layer l, which is empirically set as 7.
We compare the Area Under precision-recall Curve (AUC), micro-F1, macro-F1, precision at top N predictions (P@N, N=100, 200, 300) and the mean of P@N, which is denoted as P@M.Following the at-least-one assumption (Riedel et al., 2010), we adopt ONE strategy (Zeng et al., 2015) for bag-level evaluation, which takes the maximum score for each relation to generate bag-level predic-tions.We use the output of the last student (12) as the output of the model.
In the Appendix, we display the results from other students and the results using other l settings.We also provide detailed error analysis and extra experimental results on Wiki-20m dataset.Moreover, we try out a different initialization of the teacher to further explore the generalization of TIW.

Overall Performance
We compare the performance of our model against that of the following baselines: PCNN+ATT (Lin et al., 2016) proposes PCNN with selective attention mechanism.
Among the baselines, DISTRE, HiCLRE, CIL and PARE use pretrained language models for initialization and the last three use the same BERT pretrained encoder as ours.The held-out dataset is the mainstream for DSRE evaluation, but it contains wrongly-labeled test instances leading to inaccurate evaluation.The manual dataset provides an accurate test set but is limited by its scale in generalization.Therefore, we use both datasets for better evaluation.

Evaluation on Held-out Dataset
Table 2 show the experimental results on the heldout dataset.We use the results reported in the papers of previous work.We also plot the precisionrecall curves as in Figure 3.
As shown in the results, our model achieves the best AUC and Micro-F1 score among all the compared methods.It is shown that direct sentencelevel training (the teacher) leads to a slight decline in precision due to the existence of noise but still Model AUC Micro-F1 P@100 P@200 P@300 P@M PCNN+ATT 33.8  achieves competitive AUC and Micro-F1 on the test set because of its advantage in information utilization.The P@N of the student are relatively lower than bag-level baselines, but still improved over the teacher.Compared with the teacher, the student further alleviates noise and overfitting with TIW, thus achieving state-of-the-art performance.

Evaluation on Manual Dataset
Table 3 shows the experimental results on the manual dataset.We use the original implementations of the baselines to reproduce the results.The precision-recall curves are plotted in Figure 4. On the manual dataset, the bag-level methods still perform better at P@N, however, our method outperforms them in AUC and Micro-F1 by large margins.It shows that previous bag-level methods may overfit easy instances, leading to the loss of overall generalization despite higher precision at easy instances (P@N).Also, some of the baselines fail in handling infrequent relation classes (especially Intra+inter), but our model manages to achieve both high Micro-F1 and Macro-F1.Moreover, the student improves significantly over the teacher, especially in P@N.These results further demonstrate the effectiveness of TIW in improving sentence-level training.
According to Gao et al. (2021), the performance of the model may be inconsistent if evaluated in both the held-out and manual datasets.Good performance on the held-out set may indicate overfitting to the bias from DS.However, our model is robust enough to perform well on both datasets.

Ablation Study
The ablation study is performed using the heldout dataset.As shown in the easy instances always have the largest weights even if they are already well-fitted.The model thus overfits shallow features, which is indicated by the high P@N and the decline in overall performance.b: removes STS and follows the output probabilities from the teacher all the time.In this case, the noise from the teacher is not addressed.Fixing the soft target also leads to the fixed Uncertainty for each instance, causing the underfitting of some easy instances.Therefore, the performance declines, especially P@M.
c: removes PW and all the positive instances are treated equally, including the noisy ones.Therefore, the model is heavily affected by noise and FNF may be inaccurate, leading to further performance declines.In this case, high P@M indicates that the model overfits easy instances and loses generalization.
d: removes FNF.The FNs only make up a small part of the dataset, so the effects are relatively small.However, the noise from FNs significantly reduces P@M.We suspect that the fitting of FNs affects that of true positives.If a false negative f n has similar syntactic and semantic features to a true positive tp, fitting f n is similar to fitting tp using an incorrect label.
e: removes TIW totally and all the instances are weighted as 1.The label smoothness of knowledge distillation is able to alleviate some noise from DS, so there are improvements in performance over f.However, the student is still trained with much noise and overfits easy instances, so the overall performance declines significantly.
f : is the probing result of 12th layer using the DS label.It shows that without effective denoising mechanisms, simply retraining the classifier does not help in performance.
g: we have also tried out the Jensen-Shannon divergence (Fuglede and Topsoe, 2004), which is a variant of KL divergence.The performance is not as good as the dot product because it pushes the student to output the same probability as the teacher.However, it should be acceptable (or even better) for the student to output a higher probability for the target relation than the teacher.
The above results and discussions further demonstrate the effectiveness of TIW designs in alleviating noise and overfitting.

Conclusions and Limitations
In this paper, we propose a novel Transitive Instance Weighting mechanism integrated with selfdistillation to denoise from sentence-level training of DSRE.We employ the self-distilled BERT backbone to extract intermediate information for generating reliable instance weights.TIW combines Consistency with Uncertainty as the tools to tackle noisy instances and alleviate overfitting.It also enables soft target selection and transitive knowledge passing among the students to tackle the noise from the teacher.The experiment results show that our method improves the general resistance to DS noise and prevents overfitting from harming its generalization, thus can achieve state-of-the-art performance and consistent improvements over the baselines on both the held-out and manual datasets.
However, our work still has some limitations.Firstly, since our model is built on the basis of the teacher-student network, the performance of the student is highly affected by the teacher.If the teacher provides too much noisy information, our instance weighting mechanism might not work.Secondly, in some cases, the student fails to follow the correct predictions from the teacher, possibly due to ambiguity, lack of information or word-level noise.Finally, TIW may down-weight some instances of infrequent relation classes due to their difficulty, but it can be tackled by combining TIW with other methods addressing the long-tailed distribution of relations.

A Hyperparameter Analysis
There are two key hyperparameters in our experiments, the student selected and the head layer l.In our best model, we select the last student (12th) for evaluation and set layer 7 as the head layer.As shown in Figure 5, the higher students(≥ 9) improve significantly over the teacher.The last student performs the best and the students from 9th to 11th also achieve comparable performances.Lower layers of BERT encode shallower features and the instance weighting in lower students is more affected by noise, so the performances of 7th and 8th students show little advantage over the teacher.With knowledge passed and noise alleviated student by student, the performance gradually improves.53.5 54.9 84.6 l = 1 53.4 54.9 84.5 To study the effect of head layer l, we run experiments with l from 1 to n.In Table 5, we present the results where l = 7 achieves the best performance.For l > 7, the head layer is too close to the top and TIW filters fewer false negatives.So the P@M declines quickly, which is similar to the effect of removing FNF as in Table 4.For l < 7, the lower layers of BERT are not able to encode sufficient information for accurate relation extraction, so the lower students are not able to provide reliable instance weights, leading to the transfer of some noise among students.Though other settings are less effective than the best, their performances still dominate most of the baselines.The above re-sults show that our method is not dependent on the empirical settings of hyperparameters and further demonstrate the effectiveness and robustness of our method.

B Evaluations on Wiki-20m
In order to further explore the generalization of our method, we also experiment on the Wiki-20m dataset (Gao et al., 2021).The details of the dataset are shown in  We take the best-reported results of the baselines in Rathore et al. ( 2022) and Song et al. (2023).On Wiki-20m dataset, our model still achieves state-ofthe-art performance and the improvements over the teacher are significant.Therefore, our method can generalize well to the Wiki-20m dataset, which has more relation classes (81).

C Error Analysis
For accurate analysis of the errors, we use the test set of the manual dataset for statistical discussions.Each positive label is considered an item.The instances with multiple positive labels are considered to have multiple items.We classify the items based on the predictions of the teacher and student, then count the number and percentage of each class as in Table 8.The goal is to explore where the errors of the student come from: a) from the teacher, meaning that the knowledge from the teacher is noisy and leads to the student's errors, or b) from the student itself, meaning that the teacher gives correct knowledge but the student fails to follow.In the results, the student achieves slightly higher (about 2%) accuracy than the teacher and shows high fidelity with 97.1% of all predictions being the same as the teacher.BI represents the student's errors caused by the errors from the teacher.TISC indicates the student's corrections on the errors from the teacher and TCSI represents the errors from the student itself.From the results, we can conclude that almost all (about 97.5%) of the errors come from the teacher, and the corrections made by the student are much more than the errors made by the student itself.This demonstrates the effectiveness of our method in reducing the occurrence of errors and the limitation that it requires a good teacher for good performance.

Class
For further analysis of the student's errors, we inspect the TCSI items and select some representative ones for discussions as in Figure 6.Most of the instances with place_of_birth relation are correctly classified and the first example should be an easy instance in the form, yet misclassified by the student as place_lived.We observe several similar items and suspect that long and uncommon names like Carl Friedrich von Weizsäcker sometimes confuse the student to make conservative predictions, which is the more common relation place_lived.The second example, however, confuses the student with a compound noun Brooklyn College.Brooklyn appears very often in the dataset in the form of location, making the student believe that Brooklyn College is a location rather than an organization.The third example is mostly related to ambiguity, where the word Arab may refer to the Arab people (ethnic group) or the Arab world 179 Model AUC Micro-F1 P@100 P@200 P@300 P@M CIL 60.Table 9: The performance (%) of the models on the manual dataset.† indicates initialized using well-trained encoder.
(location).The latter two examples indicate that the lack of entity-related information may lead to inconsistency between the student and the teacher.
The first example shows that the student may be confused to lose focus on key phrases like was born in, which may be solved by combining it with word-level attention in the future.

D Effects of Teacher Model
In the main experiments of this paper, the teacher model is trained in an extremely noisy environment, leaving much room for TIW to improve performance.In order to explore the potential of TIW in improving state-of-the-art methods, we initialize the teacher model using the well-trained encoder from CIL (Chen et al., 2021) instead of the bertbase-uncased checkpoint.We repeat the experiments on the manual dataset and the results are shown in Table 9.In the results, the models initialized with a well-trained CIL encoder achieve significantly higher precision and TIW further improves the performance over the baselines.However, since CIL is trained in a bag-level setting, it has lower utilization in information than the models trained in sentence-level settings, leading to some decline in AUC and Micro-F1.These results show that the initialization of the teacher model has a great impact on the performance and that TIW can consistently improve the performance of the model with different teachers.Hopefully, TIW can be employed with more powerful teacher models to achieve even better performance in DSRE.

Figure 1 :
Figure 1: The bag-level and sentence-level pipelines of DSRE.

Figure 2 :
Figure 2: The overall framework of our model.Dotted arrows indicate the generation of instance weight.

Figure 3 :
Figure 3: PR curves of the models on the held-out dataset.

Figure 4 :
Figure 4: PR curves of the models on the manual dataset.

Figure 5 :
Figure 5: Results of the students and auxiliary classifiers of the teacher on the held-out dataset.

Table 1 :
The statistics of datasets.Sen., Fac. and Rel.indicate the numbers of sentences, relation facts and relation types (including NA) respectively.

Table 2 :
The performance (%) of the models on the held-out dataset.The best scores are marked as bold and the second best scores are underlined, as in other tables of the experiments.

Table 4
, all the modules improve the overall performance.Detailed discussions are given below: a: removes Uncertainty and directly uses the general Consistency as positive weight.In this case,

Table 3 :
The performance (%) of our model and the baselines on the manual dataset.

Table 4 :
Ablation study of our method.

Table 5 :
Results of using different head layer l settings.The best results are marked as bold.
Table 6 and the results are shown in Table 7.

Table 6 :
The statistics of Wiki-20m dataset.Sen., Fac. and Rel.indicate the numbers of sentences, relation facts and relation types (including NA) respectively.

Table 7 :
The performance (%) of our model and the baselines on the Wiki-20m dataset.

Table 8 :
Num. of items Percentage (%) Numbers and percentages of different classes of items.BC stands for both correct, BI stands for both incorrect, TISC stands for teacher incorrect, student correct and TCSI stands for teacher correct, student incorrect.