Adjacency List Oriented Relational Fact Extraction via Adaptive Multi-task Learning

Relational fact extraction aims to extract semantic triplets from unstructured text. In this work, we show that all of the relational fact extraction models can be organized according to a graph-oriented analytical perspective. An efficient model, aDjacency lIst oRiented rElational faCT (DIRECT), is proposed based on this analytical framework. To alleviate challenges of error propagation and sub-task loss equilibrium, DIRECT employs a novel adaptive multi-task learning strategy with dynamic sub-task loss balancing. Extensive experiments are conducted on two benchmark datasets, and results prove that the proposed model outperforms a series of state-of-the-art (SoTA) models for relational triplet extraction.


Introduction
Relational fact extraction, as an essential NLP task, is playing an increasingly important role in knowledge graph construction (Han et al., 2019;Distiawan et al., 2019). It aims to extract relational triplet from the text. A relational triplet is in the form of (subject, relation, object) or (s, r, o) (Zeng et al., 2019). While various prior models proposed for relational fact extraction, few of them analyze this task from the perspective of output data structure.
As shown in Figure 1, the relational fact extraction can be characterized as a directed graph construction task, where graph representation flexibility and heterogeneity accompany additional benefaction. In practice, there are three common ways to represent graphs (Gross and Yellen, 2005): Edge List is utilized to predict a sequence of triplets (edges). The recent sequence-to-sequence based models, such as NovelTagging (Zheng et al., 2017), CopyRE (Zeng et al., 2018), CopyRL (Zeng * These two authors contributed equally to this research. † Zhuoren Jiang is the corresponding author Edge list is a simple and space-efficient way to represent a graph (Arifuzzaman and Khan, 2015). However, there are three problems. First, the triplet overlapping problem (Zeng et al., 2018). For instance, as shown in Figure 1, for triplets (Obama, nationality, USA) and (Obama, president of, USA), there are two types of relations between the "Obama" and "USA". If the model only generates one sequence from the text (Zheng et al., 2017), it may fail to identify the multi-relation between entities. Second, to overcome the triplet overlapping problem, the model may have to extract the triplet element repeatedly (Zeng et al., 2018), which will increase the extraction cost. Third, there could be an ordering problem (Zeng et al., 2019): for multiple triplets, the extraction order could influence the model performance.
Adjacency Matrices are used to predict matrices that represent exactly which entities (vertices) have semantic relations (edges) between them. Most early works, which take a pipeline ap-proach (Zelenko et al., 2003;Zhou et al., 2005), belong to this category. These models first recognize all entities in text and then perform relation classification for each entity pair. The subsequent neural network-based models (Bekoulis et al., 2018;Dai et al., 2019), that attempt to extract entities and relations jointly, can also be classified into this category.
Compared to edge list, adjacency matrices have better relation (edge) searching efficiency (Arifuzzaman and Khan, 2015). Furthermore, adjacency matrices oriented models is able to cover different overlapping cases (Zeng et al., 2018) for relational fact extraction task. But the space cost of this approach can be expensive. For most cases, the output matrices are very sparse. For instance, for a sentence with n tokens, if there are m kinds of relations, the output space is n · n · m, which can be costly for graph representation efficiency. This phenomenon is also illustrated in Figure 1.
Adjacency List is designed to predict an array of linked lists that serves as a representation of a graph. As depicted in Figure 1, in the adjacency list, each vertex v (key) points to a list (value) containing all other vertices connected to v by several edges. Adjacency list is a hybrid graph representation between edge list and adjacency matrices (Gross and Yellen, 2005), which can balance space and searching efficiency 1 . Due to the structural characteristic of the adjacency list, this type of model usually adopts a cascade fashion to identify subject, object, and relation sequentially. For instance, the recent state-of-the-art model Cas-Rel (Wei et al., 2020) can be considered as an exemplar. It utilizes a two-step framework to recognize the possible object(s) of a given subject under a specific relation. However, CasRel is not fully adjacency list oriented: in the first step, it use subject as the key; while in the second step, it predicts (relation, object) pairs using adjacency matrix representation.
Despite its considerable potential, the cascade fashion of adjacency list oriented model may cause problems of sub-task error propagation (Shen et al., 2019), i.e., errors from ancestor sub-tasks may accumulate to threaten downstream ones, and subtasks can hardly share supervision signals. Multitask learning (Caruana, 1997) can alleviate this problem, however, the sub-task loss balancing prob-1 More detailed complexity analyses of different graph representations are provided in Appendix section 6.3. lem (Chen et al., 2018;Sener and Koltun, 2018) could compromise its performance.
Based on the analysis from the perspective of output data structure, we propose a novel solution, aDjacency lIst oRiented rElational faCT extraction model (DIRECT), with the following advantages: • For efficiency, DIRECT is a fully adjacency list oriented model, which consists of a shared BERT encoder, the Pointer-Network based subject and object extractors, and a relation classification module. In Section 3.4, we provide a detailed comparative analysis 2 to demonstrate the efficiency of the proposed method.
• From the performance viewpoint, to address sub-task error propagation and sub-task loss balancing problems, DIRECT employs a novel adaptive multi-task learning strategy with the dynamic subtask loss balancing approach. In Section 3.2 and 3.3, the empirical experimental results demonstrate DIRECT can achieve the state-of-the-art performance of relational fact extraction task, and the adaptive multi-task learning strategy did play a positive role in improving the task performance.
The major contributions of this paper can be summarized as follows: 1. We refurbish the relational fact extraction problem by leveraging an analytical framework of graph-oriented output structure. To the best of our knowledge, this is a pioneer investigation to explore the output data structure of relational fact extractions.
2. We propose a novel solution, DIRECT 3 , which is a fully adjacency list oriented model with a novel adaptive multi-task learning strategy.
3. Through extensive experiments on two benchmark datasets 3 , we demonstrate the efficiency and efficacy of DIRECT. The proposed DIRECT outperforms the state-of-the-art baseline models.

The DIRECT Framework
In this section, we will introduce the framework of the proposed DIRECT model, which includes a shared BERT encoder and three output layers: subject extraction, object extraction, and relation classification. As shown in Figure 2, DIRECT is fully adjacency list oriented. The input sentence is firstly fed into the subject extraction module to extract all subjects. Then each extracted subject is concatenated with the sentence, and fed into the object extraction module to extract all objects, which can form a set of subject-object pairs. Finally, the subject-object pair is concatenated with sentence, and fed into the relation classification module to get the relations between them. For balancing the weights of sub-task losses and to improve the global task performance, three modules share the BERT encoder layer and are trained with an adaptive multi-task learning strategy.

Shared BERT Encoder
In the DIRECT framework, the encoder is used to extract the semantic features from the inputs for three modules. As aforementioned, we employ the BERT (Devlin et al., 2019) as the shared encoder to make use of its pre-trained knowledge and attention mechanism.
The architecture of the shared method is shown in Figure 2. The lower embedding layer and transformers (Vaswani et al., 2017) are shared across all the three modules, while the top layers represent the task-specific outputs.
The encoding process is as follows: where x t = [w 1 , ..., w n ] is the input text of task t and h t is the hidden vector sequence of the input. Due to the limited space, the detailed architecture of BERT please refer to the original paper (Devlin et al., 2019).

Subject and Object Extraction
The subject and object extraction modules are motivated by the Pointer-Network (Vinyals et al., 2015) architecture, which are widely used in Machine Reading Comprehension (MRC) (Rajpurkar et al., 2016) task. Different from MRC task that only needs to extract a single span, the subject and object extractions need to extract multiple spans. Therefore, in the training phase, we replace sof tmax function with sigmoid function for the activation function of the output layer, and replace cross entropy (CE) (Goodfellow et al., 2016) with binary cross entropy (BCE) (Luc et al., 2016) for the loss function. Specifically, we will perform independent binary classifications for each token twice to indicate whether the current token is the start or the end of a span. The probability of a token to be start or end is as follows: where h i represents the hidden vector of the i th token, t ∈ [s, o] represents subject and object extraction respectively, W t ∈ R h×1 represents the trainable weight, b t ∈ R 1 is the bias and σ is sigmoid function. During inference, we first recognize all the start positions by checking if the probability p t i,start > α, where α is the threshold of extraction. Then, we identify the corresponding end position with the largest probability p t i,end between two neighboring start positions. Concretely, assuming pos j,start is the start position of the j th span, the corresponding end position is: Though the overall structure is similar, the inputs for subject and object extraction are different. When extracting the subject, only the original sentence needs to be input: where w i represents the i th token of the original sentence.
Meanwhile, the object extraction is based on the corresponding subject. To form the input, the subject s and the original sentence x are concatenated with [sep] as follows:

Relation classification
The output layer of relation classification is relatively simple, which is a normal multi-label classification model. The [cls] vector obtained by BERT encoder is used as the sentence embedding. A fully connected layer is used for the nonlinear transformation, and perform multi-label classification to predict relations of the input subject-object pair. The detailed operations of relation classification are as follows: where P r ∈ R c is the predicted probability vector of relations, σ is sigmoid function, W r ∈ R h×c and b r ∈ R c are the trainable weights and bias, h is the hidden size of encoder, c is the number of relations, and h [cls] denotes the hidden vector of the first token [cls]. The input for relation classification task is as follows:

Adaptive Multi-task Learning
In DIRECT, subject extraction module, object extraction module, and relation classification module can be considered as three sub-tasks. As aforementioned, if we train each module directly and separately, the error propagation problem would Algorithm 1: Adaptive Multi-task Learning with Dynamic Loss Balancing Initialize model parameters Θ randomly; Load pre-trained BERT parameters for shared encoder; Prepare the data for each task t and pack them into mini-batch: D t , t ∈ [s, o, r] ; Get the number of batch for each task: n t ; Set the number of epoch for training: 3. Initialize EMA for each task v t = 1 and its decay = 0.99 ; reduce the task performance. Meanwhile, three independent encoders would consume more memory. Therefore, we use multi-task learning to alleviate this problem, and the encoder layer is shared across three modules.
However, applying multi-task learning could be challenging in DIRECT, due to the following problems: • The input and output of the three modules are different, which means we cannot simply sum up the loss of each task.
• How should we balance the weights of losses for three sub-task modules?
These issues can affect the final results of multitask training (Shen et al., 2019;Sener and Koltun, 2018).
In this work, based on the architecture of MT-DNN (Liu et al., 2019b), we propose a novel adaptive multi-task learning strategy to address the above problems. The algorithm is shown as Algorithm 1. Basically, the datasets are firstly split into mini-batches. A batch is then randomly sampled to calculate the loss. The parameters of the shared encoder and its task-specific layer are updated accordingly. Especially, the learning effect of each task t is different and dynamically changing during training. Therefore, an approach of adaptively adjusting the weights of task losses is applied. The sum of sub-task's loss l t is utilized to approximate its optimization effect. The adaptive weight adjusting strategy ensures that the more room a sub-task has to be optimized, the more weight its loss will receive. Furthermore, an exponential moving average (EMA) (Lawrance and Lewis, 1977) is maintained to avoid the drastic fluctuations of loss weights. Last but not least, to make sure that each task has enough influence on the shared encoder, the weight of the sub-task will be penalized according to the training data amount of each sub-task.

Dataset and Experiment Setting
Datasets. Two public datasets are used for evaluation: NYT (Riedel et al., 2010) is originally produced by the distant supervision approach. There are 1.18M sentences with 24 predefined relation types in NYT. WebNLG (Gardent et al., 2017) is originally created for Natural Language Generation (NLG) tasks. (Zeng et al., 2018) adopts this dataset for relational triplet extraction task. It contains 246 predefined relation types. There are different versions of these two datasets. To facilitate comparison evaluation, we use the datasets released by (Zeng et al., 2018) and follow their data split rules.
Besides the basic relational triplet extraction, recent studies are focusing on the relational triplet overlapping problem (Zeng et al., 2018;Wei et al., 2020). Follow the overlapping pattern definition of relational triplets (Zeng et al., 2018), the sentences in both datasets are divided into three categories, namely, Normal, EntityPairOverlap (EPO), and SingleEntityOverlap (SEO). The statistics of the two datasets are described in Table 1.
Baselines: the following strong state-of-the-art (SoTA) models have been compared in the experiments.
• NovelTagging (Zheng et al., 2017) introduces a tagging scheme that transforms the joint entity and relation extraction task into a sequence labeling problem. It can be considered as edge list oriented.
• CopyRE (Zeng et al., 2018) is a seq2seq based model with the copy mechanism, which can effectively extract overlapping triplets. It has two variants: CopyRE one employs one decoder; CopyRE mul employs multiple decoders. CopyRE is also edge list oriented.
• GraphRel (Fu et al., 2019) is a GCN (graph convolutional networks) (Kipf and Welling, 2017) based model, where a relation-weighted GCN is utilized to learn the interaction between entities and relations. It is a two phases model: GraphRel 1p denotes 1st-phase extraction model; GraphRel 2p denotes full extraction model. GraphRel is adjacency matrices oriented.
• CopyRL (Zeng et al., 2019) combines the reinforcement learning with a seq2seq model to automatically learn the extraction order of triplets. CopyRL is edge list oriented.
• CasRel (Wei et al., 2020) is a cascade binary tagging framework, where all possible subjects are identified in the first stage, and then for each identified subject, all possible relations and the corresponding objects are simultaneously identified by a relation specific tagger. This work recently achieves the SoTA results. As aforementioned, Cas-Rel is partially adjacency list oriented.
Evaluation Metrics: following the previous work (Zeng et al., 2018;Wei et al., 2020), different models are compared by using standard micro Precision (Prec.), Recall (Rec.), and F1-score 4 . An extracted relational triplet (subject, relation, object) is regarded as correct only if the relation and the heads of both subject and object are all correct.
Implementation Details.
The hyperparameters are determined on the validation set. To avoid the evaluation bias, all reported results from our method are averaged results for 5 runs. More implementation details are described in Appendix section 6.1.

Results and Analysis
Relational Triplet Extraction Performance. The task performances on two datasets are summarized in Table 2. Based on the experiment results, we have the following observations and discussions: • The proposed DIRECT model outperformed all baseline models in terms of all evaluation metrics on both datasets, which proved DIRECT model can effectively address the relational triplet extraction task.
• The best-performed model (DIRECT) and runner-up model (CasRel) were both adjacency list oriented model. These two models overwhelmingly outperformed other models, which indicated the considerable potential of adjacency list (as the output data structure) for improving the task performance.
• To further compare the relation extraction ability of DIRECT and CasRel, we took a closer look at the extraction performance of relational triplet elements from these two models. As shown in Table 3 5 , DIRECT outperformed CasRel in terms of all relational triplet elements on both datasets. These empirical results suggested that, for relational triplet extraction, a fully adjacency list oriented model (DIRECT) may have advantages over a partially oriented one (CasRel). Ability in Handling The Overlapping Problem. The relational facts in sentences are often complicated. Different relational triplets may have overlaps in a sentence. To verify the ability of our models in handling the overlapping problem, we conducted further experiments on NYT dataset. Figure 3 illustrated of F1 scores of extracting relational triplets from sentences with different overlapping patterns. DIRECT outperformed all baseline models in terms of all overlapping patterns. These results demonstrated the effectiveness of the proposed model in solving the overlapping problem.
Ability in Handling Multiple Relation Extraction. We further compared the model's ability of extracting relations from sentences that contain multiple triplets. The sentences in NYT and WebNLG were divided into 5 categories. Each category contained sentences that had 1,2,3,4 or ≥ 5 triplets. The triplet number was denoted as N . As shown in Table 4: • DIRECT achieved the best performance for all triplet categories on both datasets. These experimental results demonstrated our model had an excellent ability in handling multiple relation extraction.
• In both NYT and WebNLG datasets, when the sentences contained more triplets, the leading advantage of DIRECT became greater. This observation indicated that DIRECT was good at solving complex relational fact extraction.

Ablation Study
To validate the effectiveness of components in DI-RECT, We implemented several model variants for ablation tests 6 . The results of the comparison on NYT dataset are shown in Table 5. In particular, we aim to address the following two research questions: RQ1: Is it possible to improve the model performance by sharing the parameters of extraction layers?
RQ2: Did the proposed adaptive multi-task learning strategy improve the task performance?
Effects of Sharing Extraction Layer Parameters (RQ1). As described in Section 2, the structures of subject extraction and object extraction output layers are exactly the same. To answer RQ1, we merged the subject extraction and object extraction layers into one entity extraction layer by sharing the parameters of output layers of these two modules, denoted as DIRECT shared . From the results of Table 5, we can observe that, sharing the parameters of output layers of two extraction modules would reduce the performance of the model.   A possible explanation is that, although the output of these two modules is similar, the semantics of subject and object are different. Hence, directly sharing the output parameters of two modules could lead to an unsatisfactory performance. Effects of Adaptive Multi-task Learning (RQ2). As described in Section 2, the adaptive multi-task learning strategy with the dynamic subtask loss balancing approach is proposed for improving the task performance. To answer RQ2, we replaced the adaptive multi-task learning strategy with an ordinary learning strategy. In this strategy, the losses of three sub-tasks were computed with equal weights, denoted as DIRECT equal . From the results of Table 5, we can observe that, by using adaptive multi-task learning, DIRECT was able to get a 1.5 percentage improvement on the F1-score. This significant improvement indicated that adaptive multi-task learning played a positive role in the balance of sub-task learning and can improve the global task performance.

Graph Representation Efficiency Analysis
Based on the amount estimation of predicted logits 7 , we conduct a graph representation efficiency 7 Numeric output (0/1) of the last layer analysis to demonstrate the efficiency of the proposed method 8 . For each graph representation category, we choose one representative algorithms. Edge List: CopyRE (Zeng et al., 2018); Adjacency Matrices: MHS (Bekoulis et al., 2018); Adjacency List: Cas-Rel (partially) (Wei et al., 2020) and the proposed DIRECT (fully).
The averaged predicted logits estimation for one sample 9 of different models on two datasets are shown in Table 6. MHS is adjacency matrices oriented, it has the most logits that need to be predicted. Since CasRel is partially adjacency list oriented, it needs to predict more logits than DI-RECT. Theoretically, as an edge list oriented, the predicted logits of CopyRE should be the least. But, as described in Section 1, it needs to extract the entities repeatedly to handle the overlapping problem. Hence, its graph representation efficiency could be worse than our model. The structure of our model is simple and fully adjacency list oriented. Therefore, from the viewpoint of predicted logits estimation, DIRECT is the most representative-efficient model.

Related Work
Relation Fact Extraction. In this work, we show that all of the relational fact extraction models can be unified into a graph-oriented output structure analytical framework. From the perspective of graph representation, the prior models can be divided into three categories. Edge List, this type of model usually employs sequence-to-sequence fashion, such as NovelTagging (Zheng et al., 2017), CopyRE (Zeng et al., 2018), CopyRL (Zeng et al.,    2019), and PNDec (Nayak and Ng, 2020). Some models of this category may suffer from the triplet overlapping problem and expensive extraction cost. Adjacency Matrices, many early pipeline approaches (Zelenko et al., 2003;Zhou et al., 2005;Mintz et al., 2009) and recent neural network-based models (Bekoulis et al., 2018;Dai et al., 2019;Fu et al., 2019), can be classified into this category. The main problem for this type of model is the graph representation efficiency. Adjacency List, the recent state-of-the-art model CasRel (Wei et al., 2020) is a partially adjacency list oriented model. In this work, we propose DIRECT that is a fully adjacency list oriented relational fact extraction model. To the best of our knowledge, few previous works analyze this task from the output data structure perspective. GraphRel (Fu et al., 2019) employs a graph-based approach, but it is utilized from an encoding perspective, while we analyze it from the perspective of output structure. Our work is a pioneer investigation to analyze the output data structure of relational fact extraction.
Multi-task Learning. Multi-task Learning (MTL) can improve the model performance. (Caruana, 1997) summarizes the goal succinctly: "it improves generalization by leveraging the domainspecific information contained in the training signals of related task." It has two benefits (Vandenhende et al.): (1) multiple tasks share a single model, which can save memory.
(2) Associated tasks complement and constrain each other by sharing information, which can reduce overfitting and improve global performance. There are two main types of MTL: hard parameter sharing (Baxter, 1997) and soft parameter sharing (Duong et al., 2015). Most of the multi-task learning is done by summing the loses directly, this approach is not suitable for our case. When the input and output are different, it is impossible to get two losses in one forward propagation. MT-DNN (Liu et al., 2019b) is proposed for this problem. Furthermore, MTL is difficult for training, the magnitudes of different task-losses are different, and the direct summation of losses may lead to a bias for a particular task. There are already some studies proposed to address this problem (Chen et al., 2018;Guo et al., 2018;Liu et al., 2019a). They all try to dynamically adjust the weight of the loss according to the magnitude of the loss, the difficulty of the problem, the speed of learning, etc. In this study, we adopt MT-DNN's framework, and propose an adaptive multi-task learning strategy that can dynamically adjust the loss weight based on the averaged EMA (Lawrance and Lewis, 1977) of the training data amount, task difficulty, etc.

Conclusion
In this paper, we introduce a new analytical perspective to organize the relational fact extraction models and propose DIRECT model for this task. Unlike existing methods, DIRECT is fully adja-cency list oriented, which employs a novel adaptive multi-task learning strategy with dynamic sub-task loss balancing. Extensive experiments on two public datasets, prove the efficiency and efficacy of the proposed methods. Xiaodong Liu, Pengcheng He, Weizhu Chen, and  entity extraction layer by sharing the parameters of output layers of these two modules.
• DIRECT equal , we replaced the adaptive multi-task learning strategy with an ordinary learning strategy. In this strategy, the losses of three sub-tasks were computed with equal weights, denoted as DIRECT equal .
• DIRECT threshold , we simply recognized all the start and end positions of entities by checking if the probability p t i,start/end > α, where α was the threshold of extraction.
• DIRECT adam , we used ordinary Adam as optimizer.  From the results of Table 8, we can observe that:

Method
1. Sharing the parameters of output layers of subject and object extraction modules would reduce the performance of the model.
2. Compared to ordinary multi-task learning strategy, by using adaptive multi-task learning, DIRECT was able to get a 1.5 percentage point improvement on F1-score.
3. There would be a slight drop in performance, if we just used a simple threshold policy to recognize the start and end positions of an entity.
4. Despite the difference in precision and recall, there was no significant difference between these two optimizers (ordinary-Adam & lazy-Adam ) for the task.
Results on Extracting Elements of Relational Triplets. The complete extraction performance of relational triplet elements from DIRECT and CaslRel are listed in Table 9. DIRECT outperformed CasRel in terms of all relational triplet elements on both datasets. These empirical results   suggest that, for relational triplet extraction, a fully adjacency list oriented model (DIRECT) may have advantages over a partially oriented one (CasRel).
Results of Different Methods under Exact-Match Metrics. In experiment section, we followed the match metric from (Zeng et al., 2018), which only required to match the first token of entity span. Many previous works adopted this match metric (Fu et al., 2019;Zeng et al., 2019;Wei et al., 2020).
In fact, our model is capable of extracting the complete entities. Therefore, we collected papers that reported the results of exact-match metrics (requiring to match the complete entity span). The following strong state-of-the-art (SoTA) models have been compared: • CopyMTL ) is a multi-task learning framework, where conditional random field is used to identify entities, and a seq2seq model is adopted to extract relational triplets.
• WDec (Nayak and Ng, 2020) fuses a seq2seq model with a new representation scheme, which enables the decoder to generate one word at a and can handle full entity names of different length and overlapping entities.
• PNDec (Nayak and Ng, 2020) is a modification of seq2seq model. Pointer networks are used in the decoding framework to identify the entities in the sentence using their start and end locations.
• Seq2UMTree (Zhang et al., 2020) is a modification of seq2seq model, which employs an unordered-multi-tree decoder to to minimize exposure bias.
The task performances on NYT dataset are summarized in Table 10. The proposed DIRECT model outperformed all baseline models in terms of all evaluation metrics. This experimental results further confirmed the efficacy of DIRECT for relational fact extraction task.

Complexity Analysis of Graph Representations
For a graph G = (V, E), |V | denotes the number of nodes/entities and |E| denotes the number of edges/relations. Suppose there are m kinds of relations, d(v) denotes the number of edges from node v.
• Edge List Complexity  DIRECT 2l + 2sl + or 238 542 Table 11: Graph representation efficiency based on the theoretical logits amount and the estimated logits amount on two benchmark datasets.
− Find all edges/relations from a node: O(|V | · m) • Adjacency List Complexity
Formally, for a sentence whose length is l (l tokens), there are r types of relations, k denotes the number of triplets. Suppose there are s keys (subjects) and o values (corresponding amount of object-based lists) in adjacency list. The theoretical logits amount and the estimated logits amount on two benchmark datasets (NYT and WebNLG) are shown in Table 11. From the viewpoint of predicted logits estimation, DIRECT is the most representative-efficient model.