Adaptive Hinge Balance Loss for Document-Level Relation Extraction

,


Introduction
Document-Level Relation Extraction (RE) plays an important role in NLP applications such as knowledge graph construction.It aims at predicting relations between entities from multiple sentences.As illustrated in Figure 1a, an entity pair may have zero, one, or multiple relations, so document-level RE is a multi-label classification task.To solve this, a common practice is to adaptively select thresholds for multi-label classification (Zhou et al., 2021).For a correct prediction, the confidence scores of existent relations should be higher than the threshold, and conversely, those of non-existent relations should be lower.
However, there is a significant imbalance problem between positive and negative classes in document-level RE.The number of entity pairs  The entity pair in (a) is incorrectly predicted as "no-relation".Scores of existent relations (country, located in) are lower than the threshold (11.16), but significantly higher than all non-existent relations.(c) After utilizing adaptive hinge balance loss, the threshold is reduced to an appropriate value.
increases quadratically with the number of entities.Thus, compared with the sentence-level counterpart, there are far more entity pairs to be classified in document-level RE, and most entity pairs express no relation.For example, in the documentlevel RE dataset, Re-DocRED (Tan et al., 2022b), 94% of the entity pairs express no relation.
The issue of class imbalance may lead to more incorrect "no-relation" predictions.In our paper, we mainly consider the "positive/negative imbalance", rather than the imbalance between different types of "positive relations".The positive/negative imbalance tends to drag the threshold towards the large class, that is, the "no-relation" class.We discover that 78.2% of incorrect "no-relation" predictions have correct label ranking, but the confidence score is lower than the threshold, which is shown in Figure 1b .In other words, the model has enough confidence in the existent relations, but it makes a "no-relation" prediction due to the unnecessarily high threshold.
Based on this intuitive finding, we aim to address this "incorrect predictions with correct label ranking" phenomenon to improve the accuracy.We believe that overtraining on well-classified nonexistent relations may lead to unnecessarily high thresholds.Therefore, we propose to adaptively select thresholds, and then down-weight the relations that are far from the decision boundary using Hinge Weighting.Our contributions are three-fold: • We design a general pipeline termed Separate Adaptive Thresholding, to adaptively select thresholds for multi-label classification.
• We propose a novel Adaptive Hinge Balance Loss, tackling the imbalance problem of positive and negative classes in document-level RE.
• Among all the existing balancing methods, our method achieves the highest F1 score on the common dataset Re-DocRED.

Preliminary
The task of document-level relation extraction is concerned with the prediction of relation types between subject and object entities in a given document.We will first introduce the formulation of this task, and then discuss the commonly used ATL method.

Problem Formulation
Given a document D that contains a set of entities {e i } n i=1 , the task of document-level relation extraction is to predict the relation types between the entity pairs (e s , e o ) s,o∈{1,...,n},s̸ =o , where e s and e o represent the subject entity and the object entity, respectively.The set of relations is defined as R ∪ {NA}, where R is a set of pre-defined relations and NA stands for no relation between a pair of entities.
With the document D and an entity pair (e s , e o ) contained in it, we can get the representation of the subject and object entity through: where z s and z o are the representation of the subject and object entity.Rep is a representation module.
The score of relation r is defined as s r , which can be computed via the subject and object entity representation using a bilinear classifier: where W r ∈ R d×d , b r ∈ R are model parameters.

Adaptive Thresholding Loss
Adaptive Thresholding Loss (ATL) (Zhou et al., 2021) is the most widely used loss function in transformer-based document-level relation extraction methods.It enables the model to choose multilabel classification thresholds, thereby achieving superior results when compared to the global threshold of BCE loss (Bengio et al., 2013).
In ATL, the labels of entity pair T = (e s , e o ) are divided into two subsets: positive classes P T and negative classes N T , where P T ⊆ R denotes the relations that exist between T , and N T ⊆ R denotes the relations that do not exist between the entities.ATL introduces an additional threshold class TH.If an entity pair is correctly classified, the scores of P T should be higher than TH while those of N T should be lower.ATL comprises of two parts: (5)

An Empirical Analysis of ATL
A preliminary analysis is conducted to investigate the cause of classification error in ATL, as shown in Table 1.All false predictions can be categorized into three patterns: FP, FN_CRK, and FN_IRK, which are illustrated in Figure 2. In particular, FN_CRK is the most dominant source of errors, which accounts for 78.2% of all false negative predictions.
We notice that the number of relations in N T is significantly larger than that in P T , and therefore L 2 has a much greater impact on the overall loss than L 1 (see equations ( 3) and ( 4)).Due to the dominance of L 2 , it can be rewritten as the following form: This suggests ATL learns a threshold s TH well above the candidate score, which leads to an increase in the number of FN_CRK predictions.

Adaptive Hinge Balance Loss
Based on the analysis above, we aim to maximize the distance between the decision boundary s TH and the sample point s r , r ∈ R while simultaneously down-weighting the classes distant from the boundary.To this end, we propose our Adaptive Hinge Balance Loss.

Separate Adaptive Thresholding
An ideal loss should maximize the distance from the decision boundary to the sample point.Moreover, in the loss formulation, each relation class should be independent of the others to enable individual weighing of each class.Therefore, we propose the Separate Adaptive Thresholding (SAT), which is formulated as: where σ is the sigmoid function, i.e. σ(x) = 1 1+e x .We define s TH as the decision boundary.Then d r is the distance from the decision boundary to the score of relation r The loss pushes d r to be as large as possible.The score of each relation is compared with the threshold separately.Thus we can assign different weights to different relations.Positive relations whose scores are higher than s TH + m and negative relations whose scores are lower than s TH − m will not be punished.

Hinge Weighting
To down-weight the easy and well-classified relations, i.e. the majority negative relations, we propose Hinge Weighting inspired by hinge loss.(Hearst et al., 1998) Our Hinge Weighting is shown in Figure 3.It is formulated as: where m is a constant.When the distance d r is larger than m, the relation is not penalized.Otherwise, it is penalized linearly with d r .Essentially, Hinge Weighting implies that we should avoid focusing on the relationship with large d r .2m is the maximum margin between positive and negative classes, which is illustrated in Figure 4.Note that our weighting mechanism downweights the well classified samples to zero.Since

F1
F1 with HingeABL Ign_F1 Ign_F1 with HingeABL ATLOP (Zhou et al., 2021) 77  the majority of well classified classes are negative, HingeABL achieves the effect of down-weighting the majority negative relations.

Loss Definition
Combining separate adaptive thresholding and hinge weighting, we obtain the adaptive hinge balance loss (HingeABL): where σ is the sigmoid function and w r is formulated as Equation ( 9).The hinge weights are normalized among all relations.Our adaptive hinge balance loss simultaneously maximizes the distance between the decision boundary and the sample point and down-weights easy classes that are far from it.This helps prevent over-fitting on wellclassified relations.

Setup
We conduct experiments on Re-DocRED (Tan et al., 2022b), the largest and well-labeled dataset for document-level RE.We use F1 and Ign_F1 as the metrics.Ign_F1 is measured by removing the relations existing in the training set from the dev/test sets.More details about statistics and implementation are provided in Appendix A and B. Note that we use micro F1 here in order to maintain consistency with previous methods.However, macro F1 is more suitable to illustrate whether the proposed method can perform better on minority classes.Results evaluated under macro F1 are provided in Appendix C.

Results
Different balancing methods.To compare different balancing methods, we use ATLOP (Zhou et al., 2021) as the representation module and BERT base (Devlin et al., 2019) as the encoder of it.We also compare our method with three other approaches: Balanced Softmax (Zhang et al., 2021), AML (Adaptive Margin Loss) (Wei and Li, 2022), and AFL (Adaptive Focal Loss) (Tan et al., 2022a).
To illustrate the effectiveness of hinge weighting, we implement an alternative weighted loss called MeanSAT by weighting positive and negative classes of SAT by the inverse of their number.Its formulation is in Appendix D.
The results are shown in Table 3. HingeABL achieves the highest F1 and Ign_F1 of 75.15 and 73.84 among all balancing methods.We observe a substantial increase in performance by implementing two weighting methods on the SAT.Our experiments indicate that hinge weighting surpasses constant weighting with MeanSAT, which demonstrates the superiority of HingeABL.
Besides, we compare the two margin-based loss functions, AML and HingeABL, through mathematical analysis, provided in Appendix E. We find that AML penalizes the misclassified samples linearly with the distance, while HingeABL penalizes the misclassified samples nonlinearly with the distance.The nonlinear function is strictly convex, which benefits optimization.Different document-level RE models.To test the generality of our approach, we select three commonly used transformer-based methods for document-level relation extraction and replace their loss functions with our adaptive hinge balance loss.Among the three original base methods, AT-LOP employs ATL loss, DocuNet employs Balanced Softmax loss, and KD-DocRE employs AFL loss.Both Balanced Softmax and AFL are the improvements of ATL.All methods use RoBERTa large (Zhuang et al., 2021) as their encoder.tent performance gains with the use of HingeABL.This affirms the generalizability of our approach.
Note that HingeABL's improvement seems to be less significant when the base method is more powerful.This is a natural result because better base methods employ better loss functions.Replacing a better loss function with HingeABL results in a smaller improvement.Prediction statistics.To verify whether our model solves the problem of high thresholds, we count the number of prediction patterns from Figure 2 and present the results in Table 4.Our analysis reveals that the proportion of FN and FN_CRK has decreased, indicating that the issue has been resolved.An example of prediction results before and after applying HingeABL is shown in Figure 1b and 1c.While one might assume that lowering the threshold would lead to more false positive predictions, we observe that the total proportion of false predictions actually decreases.This suggests that HingeABL achieves a good balance in its threshold selection.

Conclusion
We propose a novel Adaptive Hinge Balance Loss for document-level relation extraction to tackle the imbalance problem of positive and negative classes.Experimental results show our approach outperforms existing methods.Since our loss is modelindependent, it has potential applicability to other multi-label classification scenarios.

Limitations
Compared with classifying an entity pair known with relation, accurately determining whether a relation exists between an entity pair is a more challenging task.Despite attempts to improve accuracy through better thresholding methods, the results are still far from ideal.

A Re-DocRED Statistics
Re-DocRED is a more reliable benchmark in document-level relation extraction.It is a revised version of DocRED (Yao et al., 2019), whose annotations are pointed out to be incomplete by recent works ( (Huang et al., 2022;Tan et al., 2022a)

B Implementation Details
All experiments are implemented based on Hugging Face's Transformers (Wolf et al., 2020).In the experiment of comparing different balancing methods, we use BERT base (Devlin et al., 2019) as the encoder of ATLOP.In the experiment of comparing different RE models, we use RoBERTa large (Devlin et al., 2019) as the encoder of these RE models, for the sake of comparison on the benchmark.We select the margin of HingeABL as m = 5 when conducting experiments.We use mixedprecision training (Micikevicius et al., 2018) based on the PyTorch amp library1 .The models are optimized with AdamW (Loshchilov and Hutter, 2019) with a linear warmup (Goyal et al., 2017) for the first 6% steps followed by a linear decay to 0. The learning rate is 5e-5 for models with BERT as the encoder and 3e-5 for models with RoBERTa as the encoder.The train batch size is 4 and the test batch size is 8.We train 30 epochs for each model.For each experiment, we run 5 different seeds (1,5,42,66,233) and report the average score.All models are trained with 1 Tesla A800 GPU.

C Comparison results under macro F1
The comparison results among different balancing methods under macro F1 and macro Ign_F1 are shown in Table 7.Our proposed HingeABL still achieves the highest score under macro F1.

D Formulation of MeanSAT
MeanSAT weights positive and negative classes of SAT by the inverse of their number.It is formulated F1 Ign_F1 ATL (Zhou et al., 2021) 59.39 56.57Balanced Softmax (Zhang et al., 2021) 60.67 57.89 AML (Wei and Li, 2022) 58.65 55.81 AFL (Tan et al., 2022a) 61 as: where N p and N n are the number of positive and negative classes for the entity pair.

E Mathematical analysis of Adaptive Margin Loss and HingeABL
In addition to experiments, we also compare the two margin-based losses, Adaptive Margin Loss (AML) and HingeABL, from a mathematical analysis perspective.
Analysis 1: For a sample that is not well classified, the Adaptive Margin Loss is a linear function with respect to the distance.
The Adaptive Margin Loss is defined as: For class r, L r = max(0, m − d r ).
For a well-classified sample, d r ≥ m, L r = 0.For a sample that is not well classified, d r < m, L r = m − d r .
In the second condition, we denote c r = −d r > −m.It measures the distance between a sample that is not well classified to the decision boundary.Note that we call c r "distance" here, but it is not necessarily greater than zero.The smaller c r is, the better the sample is classified.This means we should give a larger punishment to a larger c r .Then we have: ∂L r ∂c r = 1.
This means Adaptive Margin Loss penalizes the samples that are not well classified linearly with the distance.(Note: A sample that is not well classified means c r = −d r > −m.A sample that is misclassified means c r = −d r > 0.) Analysis 2: For a sample that is not well classified, HingeABL is a strictly convex function with respect to the distance.
HingeABL is defined as: (16) The denominator r ′ ∈R max(0, m − d r ′ ) is a normalization factor, which we discard for ease of analysis.
For class r, if a sample is well classified, d r ≥ m, L r = 0.  (20) This means HingeABL penalizes the samples that are not well classified nonlinearly with the distance.The nonlinear function is strictly convex.
Analysis 3: Comparision between the Adaptive Margin Loss and HingeABL.
1. Similarities.Both the Adaptive Margin Loss and HingeABL are margin-based loss functions.They do not punish a prediction if it is correct and "good enough" (rather than "perfect"), which is a form of regularization to prevent overfitting.
For the wrong prediction part, they both give a penalty according to the distance c r .The Adaptive Margin Loss gives a linear penalty, while Hinge-ABL gives a strictly convex penalty.Compared to a linear penalty, a strictly convex penalty has mainly two advantages: 1.When c r is larger, HingeABL gives a larger penalty than the Adaptive Margin Loss. 2. Compared to linear functions, the nature of strictly convex functions makes the optimization more stable and more likely to converge to a globally optimal solution.

Subject:
University of TorontoObject: Canadian Relation: country, located in (a) A sample document in Re-DocRED dataset.Correct prediction after utilizing adaptive hinge balance loss.

Figure 1 :
Figure 1: Illustration on multi-label classification in document-level relation extraction.(a) There are two relations existing between University of Toronto and Canadian.(b)The entity pair in (a) is incorrectly predicted as "no-relation".Scores of existent relations (country, located in) are lower than the threshold (11.16), but significantly higher than all non-existent relations.(c) After utilizing adaptive hinge balance loss, the threshold is reduced to an appropriate value.

Figure 2 :Figure 4 :
Figure 2: Three false prediction patterns.(a) FP (False Positive): The entity pair is recognized as related, but not all relations are accurately recognized.(b) FN_CRK (False Negative with Correct label RanKing):The entity pair is recognized as "no-relation", and all the existent relations have higher confidence scores than non-existent relations.(c) FN_IRK (False Negative with Incorrect label RanKing): The entity pair is recognized as "no-relation", and not all existent relations have higher confidence scores than non-existent relations.

Table 1 :
Number of three false patterns of ATL's predictions on Re-DocRED.Three patterns are illustrated in Figure 2.

Table 2 :
Performance of adaptive hinge balance loss with different backbones.

Table 3 :
Comparison with other balancing methods.

Table 4 :
Statistics of prediction patterns for different loss functions.
). Re-DocRED contains 96 relations.Each document has an average of 391 entity pairs, among which 94% contains no relation.The detailed statistics are shown in Table5 and Table 6.

Table 5 :
Statistics on the whole set of Re-DocRED.

Table 6 :
Statistics on different train/dev/test dataset of Re-DocRED.