CIL: Contrastive Instance Learning Framework for Distantly Supervised Relation Extraction

The journey of reducing noise from distant supervision (DS) generated training data has been started since the DS was first introduced into the relation extraction (RE) task. For the past decade, researchers apply the multi-instance learning (MIL) framework to find the most reliable feature from a bag of sentences. Although the pattern of MIL bags can greatly reduce DS noise, it fails to represent many other useful sentence features in the datasets. In many cases, these sentence features can only be acquired by extra sentence-level human annotation with heavy costs. Therefore, the performance of distantly supervised RE models is bounded. In this paper, we go beyond typical MIL framework and propose a novel contrastive instance learning (CIL) framework. Specifically, we regard the initial MIL as the relational triple encoder and constraint positive pairs against negative pairs for each instance. Experiments demonstrate the effectiveness of our proposed framework, with significant improvements over the previous methods on NYT10, GDS and KBP.


Introduction
Relation extraction (RE) aims at predicting the relation between entities based on their context. Several studies have been carried out to handle this crucial and complicated task over decades as the extracted information can serve as a significant role for many downstream tasks. Since the amount of training data generally limits traditional supervised RE systems, current RE systems usually resort to distant supervision (DS) to fetch abundant training data by aligning knowledge bases (KBs) and texts. However, such a heuristic way inevitably introduces some noise to the generated data. Training a robust and unbiased RE system under DS data * Corresponding author noise becomes the biggest challenge for distantly supervised relation extraction (DSRE).
With awareness of the existing DS noise, Zeng et al. (2015) introduces the multi-instance learning (MIL) framework to DSRE by dividing training instances into several bags and using bags as new data units. Regarding the strategy for selecting instances inside the bag, the soft attention mechanism proposed by Lin et al. (2016) is widely used for its better performance than the hard selection method. The ability to form accurate representations from noisy data makes the MIL framework soon become a paradigm of following-up works.
However, we argue that the MIL framework is effective to alleviate data noise for DSRE, but is not data-efficient indeed: As Figure 1 shows: The attention mechanism in the MIL can help select relatively informative instances (e.g.h 1 , h 2 ) inside the bag, but may ignore the potential information of other abundant instances (e.g.h m ). In other words, no matter how many instances a bag contains, only the formed bag-level representation can be used for further training in the MIL, which is quite inefficient. Thus, our focus is on how to make the initial MIL framework efficient enough to leverage all instances while maintaining the ability to obtain an accurate model under DS data noise?
Here, we propose a contrastive-based method to help the MIL framework learn efficiently. In detail, we regard the initial MIL framework as the bag encoder, which provides relatively accurate representations for different relational triples. Then we develop contrastive instance learning (CIL) to utilize each instance in an unsupervised manner: In short, the goal of our CIL is that the instances sharing the same relational triples (i.e.positive pairs) ought to be close in the semantic space, while the representations of instances with different relational triples (i.e.negative pairs) should be far away.
Experiments on three public DSRE benchmarks -NYT10 (Riedel et al., 2010;Hoffmann et al., 2011), GDS (Jat et al., 2018) and KBP (Ling and Weld, 2012) demonstrate the effectiveness of our proposed framework CIL, with consistent improvements over several baseline models and far exceed the state-of-the-art (SOTA) systems. Furthermore, the ablation study shows the rationality of our proposed positive/negative pair construction strategy.
Accordingly, the major contributions of this paper are summarized as follows: • We discuss the long-standing MIL framework and point out that it can not effectively utilize abundant instances inside MIL bags.
• We propose a novel contrastive instance learning method to boost the DSRE model performances under the MIL framework.
• Evaluation on held-out and human-annotated sets shows that CIL leads to significant improvements over the previous SOTA models.

Methodology
In this paper, we argue that the MIL framework is effective to denoise but is not efficient enough, as the initial MIL framework only leverages the formed bag-level representations to train models and sacrifices the potential information of numerous instances inside bags. Here, we go beyond the typical MIL framework and develop a novel contrastive instance learning framework to solve the above issue, which can prompt DSRE models to utilize each instance. A formal description of our proposed CIL framework is illustrated as follows.

Input Embeddings
Token Embedding For input sentence/instance x, we utilize BERT Tokenizer to split it into several tokens: (t 1 , t 2 , . . . e 1 . . . e 2 . . . t L ), where e 1 , e 2 are the tokens corresponding to the two entities, and L is the max length of all input sequences. Following standard practices (Devlin et al., 2019), we add two special tokens to mark the beginning ([CLS]) and the end ([SEP]) of sentences.
In BERT, token [CLS] typically acts as a pooling token representing the whole sequence for downstream tasks. However, this pooling representation considers entity tokens e 1 and e 2 as equivalent to other common word tokens t i , which has been proven (Baldini Soares et al., 2019) to be unsuitable for RE tasks. To encode the sentence in an entity-aware manner, we add four extra special tokens ( Position Embedding In the Transformer attention mechanism (Vaswani et al., 2017), positional encodings are injected to make use of the order of the sequence. Precisely, the learned position embedding has the same dimension as the token embedding so that the two can be summed.

Sentence Encoder
BERT Encoder (Transformer Blocks, see Figure 2) transforms the above embedding inputs (token embedding & position embedding) into hidden feature vectors: (h 1 , h 2 , . . . h e 1 . . . h e 2 . . . h L ), where h e 1 and h e 2 are the feature vectors corresponding to the entities e 1 and e 2 . By concatenating the two entity hidden vectors, we can obtain the entity-aware sentence representation h = [h e 1 ; h e 2 ] for the input sequence x. We denote the sentence encoder H as:

Bag Encoder
Under the MIL framework, a couple of instances x with the same relational triple [e 1 , e 2 , r] form a bag B. We aim to design a bag encoder F to obtain representation B for bag B, and the obtained bag representation is also a representative of the current relational triple [e 1 , e 2 , r], which is defined as: With the help of the sentence encoder described in section 2.2, each instance x i in bag B can be first encoded to its entity-aware sentence representation h i = H(x i ). Then the bag representation B can be regarded as an aggregation of all instances' representations, which is further defined as: where K is the bag size. As for the choice of weight α i , we follow the soft attention mechanism used in (Lin et al., 2016), where α i is the normalized attention score calculated by a query-based function f i that measures how well the sentence representation h i and the predict relation r matches: A is a weighted diagonal matrix and q r is the query vector which indicates the representation of relation r (randomly initialized). Then, to train such a bag encoder parameterized by θ, a simple fully-connected layer with activation function softmax is added to map the hidden feature vector B to a conditional probability distribution p(r| B, θ), and this can be defined as: is the score associated to all relation types, n r is the total number of relations, M is a projection matrix, and b is the bias term.
And we define the objective of bag encoder using cross-entropy function as follows:

Contrastive Instance Learning
As illustrated in section 1, the goal of our framework CIL is that the instances containing the same relational triples (i.e.positive pairs) should be as close (i.e.∼) as possible in the hidden semantic space, and the instances containing different relational triples (i.e.negative pairs) should be as far (i.e. ) away as possible in the space. A formal description is as follows.
Assume there is a batch bag input (with a batch size G): (B 1 , B 2 , . . . , B G ), the relational triples of all bags are different from each other. Each bag B in the batch is constructed by a certain relational triple [e 1 , e 2 , r], and all instances x inside the bag satisfy this triple. The representation of the triple can be obtained by bag encoder as B.
We pick any two bags B s and B t:t =s in the batch to further illustrate the process of contrastive instance learning. B s is defined as the source bag constructed with relational triple [e s1 , e s2 , r s ] while B t is the target bag constructed with triple [e t1 , e t2 , r t ].
And we discuss the positive pair instance and negative pair instances for any instance x s in bag B s .
It is worth noting that all bags are constructed automatically by the distantly supervised method, which extracts relational triples from instances in a heuristic manner and may introduce true/false positive label noise to the generated data. In other words, though the instance x is included in the bag with relational triple [e 1 , e 2 , r], it may be noisy and fail to express the relation r.

Positive Pair Construction
Instance x s ∼ Random Instance x s One intuitive choice of selecting positive pair instance for instance x s is just picking another instance x s = x s from the bag B randomly. However, both of the instances x s and x s may suffer from data noise, and they are hard to express the same relational triple simultaneously. Thus, taking instance x s and randomly selected instance x s as a positive pair is not an optimal option. Instance x s ∼ Relational Triple B s Another positive pair instance candidate for instance x s is the relational triple representation B s of current bag B. Though B s can be regarded as a de-noised representation, x s may be still noisy and express other relation r = r s . Besides, the quality of constructed positive pairs heavily relies on the model performance of the bag encoder.
s From the above analysis, we can see that the general positive pair construction methods often encounter the challenge of DS noise. Here, we propose a noisefree positive pair construction method based on TF-IDF data augmentation: If we only make small and controllable data augmentation to the original instance x s , the augmented instance x * s should satisfy the same relational triple with instance x s . In detail: (1) We first view each instance as a document and view each word in the instances as a term, then we train a TF-IDF model on the total training corpus. (2) Based on the trained TF-IDF model, we insert/substitute some unimportant (low TF-IDF score, see Figure 6) words to/in instance x s with a specific ratio, and can obtain its augmented instance x * s . Particularly, special masks are added to entity words to avoid them being substituted.

Negative Pair Construction
Instance x s Random Instance x t Similarly, for instance x s in bag B s , we can randomly select an instance x t from another different bag B t as its negative pair instance. Under this strategy, x s is far away from the average representation K i=1 α i h i of the bag B t , where all α i = 1 K approximately. And the randomly selected instance x t may be too noisy to represent the relational triple of bag B t , so that the model performance may be influenced. Figure 7: Instance x s Random Instance x t Instance x s Relational Triple B t Compared to the random selection strategy, using relational triple representation B t as the negative pair instance for x s is a better choice to reduce the impact of data noise. As the instance x i can be seen as be far away from a weighted representation where all α i are learnable. Though the instance x s may still be noisy, x s and B t can not belong to the same relational triple.

Training Objective
As discussed above, for any instance x s in the source bag B s : (1) The instance x * s after controllable data augmentation based on x s is its positive pair instance. (2) The relational triple representations B t of other different (t = s) bags in the batch are its negative pair instances. The overall schematic diagram of CIL is shown in Figure 9. where sim(a, b) is the function to measure the similarity between two representation vectors a, b, and h s = H(x s ), h * s = H(x * s ) are the sentence representations of instances x s , x * s . Besides, to inherit the ability of language understanding from BERT and avoid catastrophic forgetting (McCloskey and Cohen, 1989), we also add the masked language modeling (MLM) objective to our framework. Pre-text task MLM randomly masks some tokens in the inputs and allows the model to predict the masked tokens, which prompts the model to capture rich semantic information in the contexts. And we denote this objective as L M (θ).
Accordingly, the total training objective of our contrastive instance learning framework is: where N = KG is the total number of instances in the batch, λ M is the weight of language model objective L M , and λ(t) ⊂ [0, 1] is an increasing function related to the relative training steps t: At the beginning of our training, the value of λ(t) is relatively small, and our framework CIL focuses on obtaining an accurate bag encoder (L B ). The value of λ(t) gradually increases to 1 as the relative training steps t increases, and more attention is paid to the contrastive instance learning (L C ).

Experiments
Our experiments are designed to verify the effectiveness of our proposed framework CIL.

Benchmarks
We evaluate our method on three popular DSRE benchmarks -NYT10, GDS and KBP, and the dataset statistics are listed in Table 1. NYT10 (Riedel et al., 2010) aligns Freebase entity relations with New York Times corpus, and it has two test set versions: (1) NYT10-D employs held-out KB facts as the test set and is still under distantly supervised. (2) NYT10-H is constructed manually by (Hoffmann et al., 2011), which contains 395 sentences with human annotations. GDS (Jat et al., 2018) is created by extending the Google RE corpus with additional instances for each entity pair, and this dataset assures that the at-least-one assumption of MIL always holds.
KBP (Ling and Weld, 2012) uses Wikipedia articles annotated with Freebase entries as the training set, and employs manually-annotated sentences from 2013 KBP slot filling assessment results (Ellis et al., 2012) as the extra test set.

Evaluation Metrics
Following previous literature (Lin et al., 2016;Vashishth et al., 2018;Alt et al., 2019), we first conduct a held-out evaluation to measure model performances approximately on NYT10-D and GDS. Besides, we also conduct an evaluation on two human-annotated datasets (NYT10-H & KBP) to further support our claims. Specifically, Precision-Recall curves (PR-curve) are drawn to show the trade-off between model precision and recall, the Area Under Curve (AUC) metric is used to evaluate overall model performances, and the Precision at N (P@N) metric is also reported to consider the accuracy value for different cut-offs.

Baseline Models
We choose six recent methods as baseline models.

Evaluation on Distantly Supervised Set
We summarize the model performances of our method and above-mentioned baseline models in Table 2. From the results, we can observe that: (1) On both two datasets, our proposed framework CIL achieves the best performance in all metrics. The overall PR-curve on NYT10-D is visualized in Figure 10. From the curve, we can observe that: (1) Compared to PR-curves of other baseline models, our method shifts up the curve a lot. (2) Previous SOTA model DISTRE performs worse than model RESIDE at the beginning of the curve and yields a better performance after a recall-level of approximately 0.25, and our method CIL surpasses previous two SOTA models in all ranges along the curve, and it is more balanced between precision and recall. (3) Furthermore, as a SOTA scheme of relation learning, MTB fails to achieve competitive results for DSRE. This is because MTB relies on label information for pre-training, and noisy labels in DSRE may influence its model performance.

Evaluation on Manually Annotated Set
The automated held-out evaluation may not reflect the actual performance of DSRE models, as it gives false positive/negative labels and incomplete KB information. Thus, to further support our claims, we also evaluate our method on two human-annotated datasets, and the results 2 are listed in Table 3.  From the above result table, we can see that: (1) Our proposed framework CIL can still perform well under accurate human evaluation, with averagely 21.7% AUC improvement on NYT10-H and 36.2% on KBP, which means our method can generalize to real scenarios well. (2) On NYT10-H, DISTRE fails to surpass PCNN-ATT in metric P@Mean. This indicates that DISTRE gives a high recall but a low precision, but our method CIL can boost the model precision (54.1→63.0) while continuously improving the model recall (37.8→46.0). And the human evaluation results further confirm the observations in the held-out evaluation described above. We also present the PR-curve on KBP in Figure 11. Under accurate sentence-level evaluation on KBP, the advantage of our model is more obvious with averagely 36.2% improvement on AUC, 17.3% on F1 and 3.9% on P@Mean, respectively.

Ablation Study
To further understand our proposed framework CIL, we also conduct ablation studies.
We firstly conduct an ablation experiment to verify that CIL has utilized abundant instances inside bags: (1) By removing our proposed contrastive instance learning, the framework degenerates into vanilla MIL framework, and we train the MIL on regular bags (MIL bag ). (2) To prove the MIL can not make full use of sentences, we also train the MIL on sentence bags (MIL sent ), which repeats each sentence in the training corpus to form a bag 3 .  From Table 4 we can see that: (1) MIL bag only resorts to the accurate bag-level representations to train the model and fails to play the role of each instance inside bags; thus, it performs worse than our method CIL (50.8→40.3). (2) Though MIL sent can access all training sentences, it loses the advantages of noise reduction in MIL bag (40.3→30.6). The noisy label supervision may wrongly guide model training, and its model performance heavily suffers from DS data noise (86.0→63.3). (3) Our framework CIL succeeds in leveraging abundant instances while retaining the ability to denoise.

Method
To validate the rationality of our proposed positive/negative pair construction strategy, we also conduct an ablation study on three variants of our framework CIL. We denote these variants as: CIL randpos : Randomly select an instance x s also from bag B s as the positive pair instance for x s . CIL bagpos : Just take the relational triple representation B s as the positive pair instance for x s . CIL randneg : Randomly select an instance x t from another bag B t as the negative pair instance for x s .
And we summarize the model performances of our CIL and other three variants in Table 5.
As the previous analysis in section 2.4, the three variants of our CIL framework may suffer from DS noise: (1) Both variants CIL randpos and CIL bagpos may construct noisy positive pairs; therefore, their model performances have a little drop (50.8→49.2, 50.8→47.8). Besides, the variant CIL bagpos also relies on the bag encoder, for which it performs worse than the variant CIL randpos (49.2→47.8).
(2) Though the constructed negative pairs need not be as accurate as positive pairs, the variant CIL randneg treats all instances equally, which gives up the advantage of formed accurate representations. Thus, its model performance also declines (50.8→48.4).

Case Study
We select a typical bag (see Table 6) from the training set to better illustrate the difference between MIL sent , MIL bag and our framework CIL.

Sentence
Predicted Relation john mcgahern, the eldest of seven children, was born on nov.12, 1934, in dublin.
S: /place borned B: /place borned C: /place deaded Table 6: A typical bag selected from the training set: The bag is constructed with relational triple (john mcgahern, /place borned, dubin), and the first sentence (S1) is clean to express relation /place borned while the second instance (S2) are noisy with true relation /place deaded. S: MIL sent , B: MIL bag and C: CIL.
Under MIL sent pattern, both S1, S2 are used for model training, and the noisy sentence S2 may confuse the model. As for MIL bag pattern, S1 is assigned with a high attention score while S2 has a low attention score. However, MIL bag only relies on the bag-level representations, and sentences like S2 can not be used efficiently. Our framework CIL makes full use of all instances (S1, S2) and avoids the negative effect of DS data noise from S2.

Related Work
Our work is related to DSRE, pre-trained language models, and recent contrastive learning methods.
DSRE Traditional supervised RE systems heavily rely on the large-scale human-annotated dataset, which is quite expensive and time-consuming. Distant supervision is then introduced to the RE field, and it aligns training corpus with KB facts to generate data automatically. However, such a heuristic process results in data noise and causes classical supervised RE models hard to train. To solve this issue, Lin et al. (2016) applied the multi-instance learning framework with selective attention mechanism over all instances, and it helps RE models learn under DS data noise. Following the MIL framework, recent works improve DSRE models from many different aspects: (1) (Defferrard et al., 2016) to encode syntactic information from the text and improves DSRE models with additional side information from KBs. (4) Alt et al. (2019) extended the GPT to the DSRE, and finetuned it to achieve SOTA model performance.
Pre-trained LM Recently pre-trained language models achieved great success in the NLP field. Vaswani et al. (2017) proposed a self-attention based architecture -Transformer, and it soon becomes the backbone of many following LMs. By pre-training on a large-scale corpus, BERT (Devlin et al., 2019) obtains the ability to capture a notable amount of "common-sense" knowledge and gains significant improvements on many tasks following the fine-tune scheme. At the same time, GPT (Radford et al., 2018), XL-Net (Yang et al., 2019) and GPT2 (Radford et al., 2019) are also well-known pre-trained representatives with excellent transfer learning ability. Moreover, some works (Radford et al., 2019) found that considerably increasing the size of LM results in even better generalization to downstream tasks.
Contrastive Learning As a popular unsupervised method, contrastive learning aims to learn representations by contrasting positive pairs against negative pairs (Hadsell et al., 2006;Oord et al., 2018;Chen et al., 2020;He et al., 2020). Wu et al. (2018) proposed to use the non-parametric instance-level discrimination to leverage more information in the data samples. Our approach, however, achieves the goal of data-efficiency in a more complicated MIL setting: instead of contrasting the instance-level information during training, we find that instance-bag negative pair is the most effective method, which constitutes one of our main contributions. In the NLP field, Dai and Lin (2017) proposed to use contrastive learning for image caption, and Clark et al. (2020) trained a discriminative model for language representation learning. Recent literature (Peng et al., 2020) has also attempted to relate the contrastive pre-training scheme to classical supervised RE task. Different from our work, Peng et al. (2020) aims to utilize abundant DS data and help classical supervised RE models learn a better relation representation, while our CIL focuses on learning an effective and efficient DSRE model under DS data noise.

Conclusion
In this work, we discuss the long-standing DSRE framework (i.e.MIL) and argue the MIL is not efficient enough, as it aims to form accurate bag-level representations but sacrifices the potential informa-tion of abundant instances inside MIL bags. Thus, we propose a contrastive instance learning method CIL to boost the MIL model performances. Experiments have shown the effectiveness of our CIL with stable and significant improvements over several baseline models, including current SOTA systems.