Dependency-driven Relation Extraction with Attentive Graph Convolutional Networks

Syntactic information, especially dependency trees, has been widely used by existing studies to improve relation extraction with better semantic guidance for analyzing the context information associated with the given entities. However, most existing studies suffer from the noise in the dependency trees, especially when they are automatically generated, so that intensively leveraging dependency information may introduce confusions to relation classification and necessary pruning is of great importance in this task. In this paper, we propose a dependency-driven approach for relation extraction with attentive graph convolutional networks (A-GCN). In this approach, an attention mechanism upon graph convolutional networks is applied to different contextual words in the dependency tree obtained from an off-the-shelf dependency parser, to distinguish the importance of different word dependencies. Consider that dependency types among words also contain important contextual guidance, which is potentially helpful for relation extraction, we also include the type information in A-GCN modeling. Experimental results on two English benchmark datasets demonstrate the effectiveness of our A-GCN, which outperforms previous studies and achieves state-of-the-art performance on both datasets.


Introduction
Relation extraction (RE), which aims to detect the relationship between entity mentions from raw text, is one of the most important tasks in information extraction and retrieval, and plays a crucial role in supporting many downstream natural language processing (NLP) applications such as text mining (Distiawan et al., 2019), sentiment analysis (Sun * Equal contribution. † Corresponding author. 1 The code and models involved in this paper are released at https://github.com/cuhksz-nlp/RE-AGCN. Figure 1: An illustration of noises in the dependency tree that can hurt relation extraction, where the word dependency connected in between "pumpkin mixture" and "bowl" (whose relation is content-container) may introduce confusion to the system when the object is to predict the relation between "milk" and "pumpkin mixture" (whose relation is entity-destination).
Recently, neural RE methods (Zeng et al., 2014;Zhang and Wang, 2015;Xu et al., 2015;dos Santos et al., 2015;Wang et al., 2016;Zhou et al., 2016;Zhang et al., 2017) with powerful encoders (such as CNN, RNN, and Transformers) have significantly improved model performance for RE without requiring any elaborately designed systems or manually constructed features. These methods are superior in capturing contextual information and thus enable RE systems to better understand the text and identify relations between entities in the given text. Adopting neural models to help RE is not only straightforward and effective, but is also expected to incorporate more diverse and informative knowledge into RE systems. Among all different knowledge sources, syntactic information, especially the dependency trees, have been demonstrated to be beneficial in many studies (Miwa and Bansal, 2016;Zhang et al., 2018;Sun et al., 2020;Chen et al., 2021) because they provide long-distance word connections between useful words and thus accordingly guide the system to better extract relations between entity pairs. However, intensively leveraging dependency information may not always lead to good RE performance, because the noise in the dependency tree can potentially introduce confusions to relation classification (Xu et al., 2015;Yu et al., 2020), especially when those trees are automatically generated. For example, Figure 1 shows an example sentence with its dependency tree, where the dependency connection between "pumpkin mixture" and "bowl" may introduce noise when the object is to predict the relation between "milk" and "pumpkin mixture". Therefore, previous studies have always required necessary pruning strategies before encoding the dependency information through a particular model such as LSTM (Xu et al., 2015) or graph convolutional networks (GCN) (Zhang et al., 2018). Because fixed pruning strategies are not guaranteed to result in a sub-tree with all important contextual information included and with all noise filtered out, it is necessary to design an appropriate way for distinguishing the noise in the dependency tree and modelling them accordingly.
In this paper, we propose a dependency-driven neural approach for RE, where attentive graph neural network (A-GCN) is proposed to distinguish the important contextual information for this task. Furthermore, given that the dependency types (e.g., nominal subject) that associate with dependency connections are also potentially useful for RE since they contain the syntactic instruction among connected words, we further improve A-GCN by introducing type information into it. Specifically, we first obtain the dependency tree of an input sentence from an off-the-shelf toolkit, then build the graph over the dependency tree, and assign different weights to different labeled dependency connections between any two words, with the weights computed based on the connections and their dependency types, lastly predict relations by the A-GCN according to the learned weights. In doing so, not only is A-GCN able to distinguish important contextual information from dependency trees and leverage them accordingly, such that reliance on pruning strategies is unnecessary, but A-GCN can also leverage the dependency type information that is omitted by most previous studies (in particular, the studies that also use attention mechanism (Guo et al., 2019)). Experimental results on two English benchmark datasets, i.e., ACE2005EN and Se-mEval 2010 Task 8, demonstrate the effectiveness of our approach to RE through A-GCN equipped with dependency type information. State-of-the-art performance is observed on both datasets.

The Proposed Approach
RE is conventionally performed as a typical classification task. Our approach follows this paradigm by using A-GCN and incorporates dependency information to improve model performance, where the overall architecture of our model is illustrated in Figure 2. Specifically, given an unstructured input sentence X = x 1 , · · · , x n with n words and let E 1 and E 2 denote two entities in X , our approach predicts the relation b r between E 1 and E 2 by where T X is the dependency tree of X obtained from an off-the-shelf toolkit, R is the relation type set; p computes the probability of a particular relation r 2 R given the two entities and b r the output of A-GCN, which takes X and T X as the input. Following texts start with a brief introduction of the standard GCN model, then elaborate our proposed A-GCN equipped with dependency type information, and lastly illustrate the process of applying A-GCN to the classification paradigm for RE.

Standard Graph Convolutional Networks
Generally, a good text representation is a prerequisite to achieve outstanding model performance (Song et al., 2017;Bojanowski et al., 2017;Hajdik et al., 2019). To enhance the text representation and thus obtain a good understanding of the running text, many studies (Song et al., 2009(Song et al., , 2012Song and Xia, 2013;Xu et al., 2015;Miwa and Bansal, 2016;Mandya et al., 2020;Nie et al., 2020) tried to leverage contextual features, such as n-grams and syntactic information, through different model architectures. Among all these architecture choices, graph convolutional networks (GCN) is a widely used architecture to encode the information in a graph, where in each GCN layer, information in each node communicates to its neighbors through the connections between them. The effectiveness of GCN models to encode the contextual information over a graph of an input sentence has been demonstrated by many previous studies (Zhang et al., 2018;Guo et al., 2019;Sun et al., 2020;Chen et al., 2020;Yu et al., 2020;Mandya et al., 2020;Tian et al., 2020cTian et al., , 2021a. Normally, the graph in the standard GCN model is built from word dependencies and is represented by an adjacency matrix A = (a i,j ) n⇥n where a i,j = 1 if i = j or there is a dependency connection 2 (arc) between two words x i and x j in the dependency tree T X and a i,j = 0 otherwise. Based on A, for Figure 2: The overall architecture of the proposed A-GCN for RE illustrated with an example input sentence (the two entities "defamation" and "bishop" are highlighted in blue and red colors, respectively) and its dependency tree. The left part shows our A-GCN model where the attention weights are applied to different connections to model the dependency type-aware contextual information. The right part illustrates the adjacency matrix A for the dependency graph and the process to compute the attention weights (i.e., p (l) i,j ) for different connections.
each word x i 2 X , the l-th GCN layer gathers the information carried by its context words in T X and computes the output representation h (l) i for x i by: where h (l 1) j denotes the output representation of x j from the (l-1)-th GCN layer 3 , W (l) and b (l) are trainable matrices and the bias for the l-th GCN layer, respectively, and is the ReLU activation.

A-GCN with Dependency Type
It is noted that in standard GCN (e.g., Eq. (2)), the connections among words are treated equally (i.e., a i,j is either 0 or 1). Therefore, GCN-based models for RE are not able to distinguish the importance of different connections and thus pruning on them is of great importance for RE. Therefore, we propose A-GCN for this task, which uses an attention mechanism to compute the weights for different connections so that the model is able to 3 h (0) j is the output of the encoder for xj. leverage different dependency connections accordingly. In addition, the standard GCN and most previous studies omit the dependency types associated with the dependency connections, where those types contain highly useful information for RE and are introduced into A-GCN in this work. Specifically, we firstly represent dependency types in T X by a type matrix T = (t i,j ) n⇥n , where t i,j is the dependency type (e.g., nsubj) associated with the directed dependency connection 4 between x i and x j . Next, we map each type t i,j to its embedding e t i,j . Then, at the l-th GCN layer, the weight for the connection between x i and x j is computed by where a i,j 2 A, "·" denotes inner production, and s and s (l) with " " denoting the vector concatenation operation. Afterwards, we apply the weight p (l) i,j to the associated dependency connection between x i and x j and obtain the output representation of with , W (l) , and b (l) following the same notations in Eq.
(2) for standard GCN, and e h where W (l) T maps the dependency type embedding e t i,j to the same dimension as h (l 1) j . Compared with standard GCN (i.e., Eq. (2)), our approach uses numerical weighting (i.e., p (l) i,j 2 [0, 1]) rather than a binary choice for a i,j , to distinguish the importance of different connections so as to leverage them accordingly. In addition, we integrate the dependency type information into both the computed weight (i.e., p (l) i,j ) and the output representation of x i (i.e., h (l) i ), which is not considered in most previous studies.

Relation Extraction with A-GCN
Before applying A-GCN for RE, we firstly encode the input X into hidden vectors by BERT (Devlin et al., 2019) with h (0) i denoting the hidden vector for x i , where the hidden vector (denoted as h X ) for the special sentence initial token "[CLS]" is used as the representation for the entire sentence. Next, we feed h i . Then, we apply the max pooling mechanism to the output hidden vectors of the words that belongs to an entity mention (i.e., E k , k = 1, 2) to compute the representation for entity (denoted as h E k ) by Afterwards, we concatenate the representations of the sentence (i.e., h X ) and two entities (i.e., h E 1 and h E 2 ) and apply a trainable matrix W R to the computed vector to map it to the output space by  where o is a |R|-dimensional vector with each of its value referring to a relation type in the relation type set R. Finally, we apply a softmax function of o to predict the relation b r between E 1 and E 2 by with o u representing the value at dimension u in o.

Datasets
In the experiments, we use two English benchmark datasets for RE, namely, ACE2005EN (ACE05) 5 and SemEval 2010 Task 8 (SemEval) 6 (Hendrickx et al., 2010). For ACE05, we use its English section and follow previous studies (Miwa and Bansal, 2016;Christopoulou et al., 2018;Ye et al., 2019) to pre-process it (two small subsets cts and un are removed) and split the documents into training, development, and test sets 7 . For SemEval, we use its official train/test split 8 . The numbers of unique relation types in ACE05 and SemEval are 7 and 19, respectively. We report the number of instances (i.e., entity pairs), for train/dev/test sets of ACE05 and SemEval benchmark datasets in Table 1.

Dependency Graph Construction
To construct graphs for A-GCN, we use Standard CoreNLP Toolkits (SCT) 9 to obtain the dependency tree T X for each input sentence X . Although our approach is able to distinguish the importance of different dependency connections through the attention mechanism, it is still beneficial if we can filter out those dependency connections that bring confusions to RE through particular pruning strategies. Motivated by previous studies (Xu et al., 2015; Figure 3: An illustration on the two (i.e., local and global) groups of dependency connections for an example sentence (entities are highlighted in red color) with an adjacency matrix (on the right) built upon all connections from the two groups. Local and global connections are represented in orange and blue colors, respectively, Zhang et al., 2018;Yu et al., 2020), in this paper, we construct the graph for A-GCN by including two groups of dependency connections, namely, the local connections and the global connections. In detail, local connections include all dependencies that directly connect to the heads of two entities and global connections include all dependencies along the shortest dependency path (SDP) between the head of two entities, where in many cases words that do not directly connected to the two entities are also involved. With an example sentence including two entities (i.e., "company" and benchmarking), Figure 3 illustrates the two groups of dependency connections and the resulted adjacency matrix, which is built with the connections from the two groups 10 . It is worth noting that, when the SDP is short, there might be more connections in the local group than that in the global one.

Implementation
Following Soares et al. (2019), we insert four special tokens (i.e., "<e1>", "</e1>", "<e2>", and "</e2>") into the input sentence to mark the boundary 11 of the two entities to be investigated, which allows the encoder to distinguish the position of entities during encoding and thus improves model performance. For the encoder, we try BERT (Devlin et al., 2019), because it is a powerful pre-trained language model which and whose variants have achieved state-of-the-art performance in many NLP tasks (Wu and He, 2019;Soares et al., 2019;Diao et al., 2020;Antoun et al., 2020;Tian et al., 2020aTian et al., ,b,d, 2021bQin et al., 2021;. Specifically, we use the uncased version of BERT-base and 10 We do not distinguish the two groups of connections in A-GCN once they are represented by the adjacency matrix. 11 For example, "<e1>" and "</e1>" are respectively inserted right before and after the entity E1 in the input X . BERT-large 12 following the default settings (e.g., for BERT-base, we use 12 layers of multi-head attentions with 768-dimensional hidden vectors; for BERT-large, we use 24 layers of multi-head attentions with 1024-dimensional hidden vectors). For A-GCN, we randomly initialize all trainable parameters and the dependency type embeddings. For evaluation, we follow previous studies to use the standard micro-F1 scores 13 for ACE05 and use the macro-averaged F1 scores 14 for SemEval. In our experiments, we try different combinations of hyper-parameters, and tune them on the dev set, then evaluate on the test set by the model that achieves the highest F1 score on the dev set. 15

Overall Results
In the experiments, we run our A-GCN models using BERT-base and BERT-large encoder on graphs with and without applying dependency pruning strategies, which correspond to the graph built upon the combined local and global connections ("L + G"), as well as the one constructed by the full dependency graph ("Full"), respectively. We also run baselines with standard GCN and standard graph attentive networks (GAT) (Veličković et al., 2017) with the same graph. For both standard GCN and A-GCN, we try different numbers of layers (i.e. 1 to 3 layers). In addition, we try BERT-base and BERTlarge baselines without using any dependency information.  Table 2: F1 scores of our A-GCN models and the baselines (i.e., BERT-only, standard GAT, and standard GCN) under different settings with BERT-base (a) and BERT-large (b) used. All graph-based models (i.e., GAT, GCN, and A-GCN) are tested with two settings: the first is using the full graph (FULL) with all dependency connections involved and the second is using the combination of local and global connections (L + G). We also run GCN and A-GCN with different numbers of layers (i.e., 1 to 3 layers) for fair comparisons. models and all the aforementioned baselines on the test set of ACE05 and SemEval. 16 There are several observations. First, A-GCN functions well when using BERT-base or BERTlarge as encoder, where the consistent improvement is observed over the BERT-only baselines (ID: 1) across two benchmark datasets, even though the BERT baselines have already achieve good performance. Second, for both datasets, A-GCN outperforms GAT (ID: 2, 3) and standard GCN baselines (ID: 4,6,8,10,12,14) with the same graph (i.e., either "L + G" or "Full") and equal number of layers. Particularly, when full dependency graph is used, it is noted that, in some cases (e.g., ID: 8 for BERT-base on ACE05), standard GCN obtains very limited improvements (or even worse results) over the BERT-only baseline (ID: 1), whereas our A-GCN models (e.g., ID: 9 for BERT-base) is able to consistently outperform the BERT-only baseline and achieve higher performance. We attribute this observation to the attention mechanism used to weigh different dependency connections, which allows A-GCN to distinguish the noise in the graph and thus leverage useful dependency information accordingly. Third, among the models with different numbers of A-GCN layers, the ones (e.g., ID: 11 for BERT-base and ID: 11 for BERT-large) with two A-GCN layers achieves the highest scores, where similar tread is observed from the standard GCN baselines. Besides, we find that our A-GCN 16 For the same group of models, we report the F1 scores on the development sets in Appendix C and the mean and standard deviation of their test set results in Appendix D. models (as well as the standard GCN baselines) with the local and global connections (i.e., "L + G") consistently outperform the ones with full dependency graph (i.e., "Full"). These observations are relatively intuitive since the dependency information may introduce more noise to RE when it is leveraged in an intensive way (e.g., by using more layers or the full dependency tree without pruning).

Comparison with Previous Studies
In addition, we compare our best models (with "L + G" or "Full" graphs) using BERT-large encoder and two A-GCN layers (ID: 9 and 11) with previous studies. The test results (F1 scores) are reported in Table 3, where our model with both local and global connections (i.e., "L + G") outperforms all previous studies and achieves state-ofthe-art performance on the two benchmark datasets. Specifically, compared with Guo et al. (2019) who proposed an graph-based approach with attentions to leverage dependency connections, our approach leverages both dependency connections and dependency types among all input words and thus provides a better way to comprehensively leverage the dependency information. In addition, although Mandya et al. (2020) proposed an approach to leverage both dependency connections and dependency types through attentions, they added the dependency type directly to the input word embeddings along with POS embeddings, and the attention in their approach is a separate stand-alone module which is added on the top of the GCN layer. On the contrary, in our approach, the dependency type  is added to each A-GCN layer and the attention mechanism is directly applied to each dependency connection in the A-GCN layer. Therefore, compared with Mandya et al. (2020), our A-GCN encodes the dependency connections and dependency types in a more intensive manner and thus can better leverage them to guide the process of predicting the relations between the given entities.

The Effect of A-GCN
Dependency information is supposed to be beneficial for RE because it contains long-distance wordword relations, which could be extremely useful when the given two entities are far away from each other in the input sentence. To explore the effect of A-GCN in capturing such long-distance wordword relations to help with RE, we split the test instances into different groups according to their entities' distances (i.e., the number of words between the two entities) and run models on these groups to test their performance. Figure 4 shows the performance of our best performing A-GCN model with BERT-large (ID: 11 in Table 2) and its corresponding standard GCN and BERT-large baselines on the three groups of test instances from the test set of SemEval, where the category name indicates the range of the entity distance. 17 It is observed that, A-GCN outperforms the two baselines on all groups of test instances and the improvement becomes larger when the entity distance increases. This observation confirms that our approach is able to leverage dependency information and capture long-distance word-word relations to improve RE. Figure 4: Performance (F1 scores) of different models (i.e., BERT-only, two layers of standard GCN, and two layers of A-GCN) with the BERT-large encoder on three groups of test instances from SemEval, where each group is generated based on the distance (i.e., number of words) between two entities in an instance.

The Effect of Graph Construction
In the main experiments, we try A-GCN with the graph built upon the combined local and global connections ("L + G"). To explore the effect of the local connections and the global connections for A-GCN, we run our approach using two A-GCN layers with the graph constructed by local connections ("L") or global connections ("G") alone. Table  4 presents the experimental results (F1 scores) of different models with BERT-base and BERT-large encoders, where the results from BERT-only baselines, A-GCN (L + G), and A-GCN (Full) are also copied from Table 2 for reference. Compared to A-GCN (L + G), models with the graph constructed by either local connections (i.e., A-GCN (L)) or global connections (i.e., A-GCN (G)) achieve lower performance, which complies with our intuition because both groups of connections contain important contextual features for RE. Interestingly, it is found that A-GCN (L) outperforms A-GCN (G) with both BERT-base and BERT-large encoders. A possible explanation could be the following. There are overlaps between local and global connections (e.g., the connection between "range" and "restrictions" in Figure 3). Therefore, A-GCN (L) can not only leverage the contextual information associated with the entities themselves, but is also partially 18 benefited from the overlapping connections on the SDP between the two entities, which leads A-GCN (L) to achieve a higher performance than A-GCN (G).

Ablation Study
Compared with the standard GCN, A-GCN enhances it from two aspects: (1) using an attention  Table 4: Performance of our models with two A-GCN layers using the graphs built upon (1)  mechanism to weigh different dependency connections and (2) introducing dependency types to the process to encode more detailed dependency information. To better investigate the effect of each individual enhancement (i.e., the attention mechanism or the dependency type information), we conduct an ablation study on our best model, i.e., two layers of A-GCN (L + G) with BERT-base and BERTlarge encoder. Table 5 reports the experimental results of different models, where the performance of BERT-only baseline and the standard GCN baseline (i.e., the one uses neither the attention mechanism nor dependency types) are also reported for reference. The results clearly indicate that, the ablation of either enhancement (i.e., the attention mechanism or the dependency type information) could result in worse results (compared with full A-GCN). Between the two enhancements, the ablation of the attention mechanism hurts A-GCN more, which indicates the ability of distinguishing important connections and leveraging them accordingly plays a more important role in RE.

Case Study
To explore in detail that how A-GCN leverages dependency connections and types to improve RE, we conduct a case study with our A-GCN models with different dependency graphs (i.e., two layers of A-GCN (Full) and A-GCN (L + G) with BERTlarge encoder) on an example sentence "A central vacuum is a vacuum motor and filtration system built inside a canister.". Figure 5 shows the sentence where both the two models correctly predict the relation between "motor" (E 1 ) and "canister"  Table 5: The ablation study on the attention mechanism (ATT.) and dependency types (TYPE) in our best model, i.e., two layers of A-GCN (L + G). " p " and "⇥" stand for that whether a module is used. The F1 scores of BERT-only and the standard two layers of GCN (L + G) are also reported for references.
(E 2 ) (highlighted in the red color) to be "Content-Container", whereas the baseline GCN (Full) and GCN (L + G) models fail to do so. We also visualize the attention weights assigned to different dependency connections extracted from the last A-GCN layer, with darker and thicker lines referring to higher weights. In this example, for A-GCN (Full), we observe that the connection between "built" and "canister" along SDP and the connection between "inside" and "canister" receive the highest weights, where this is valid because the dependency type, i.e., obl (oblique nominal), associated with the connection (between "built" and "canister") reveals that "canister" could be the position where the action (i.e., build) takes place, and is further confirmed by another dependency connection and type (i.e., case) between "inside" and "canister". Therefore, it is proved that our model learn from the contextual information carried by such important connections and results in correct RE prediction. Similarly, A-GCN (L + G) also correctly perform RE on this case by highlighting the same dependency connections as those from the A-GCN (Full) with much higher weights (because many dependency connections are filtered out).

Related Work
Recently, neural networks with integrating external knowledge or resources play important roles in RE because of their superiority in better capturing contextual information (Shen and Huang, 2016;Soares et al., 2019). Particularly, as one kind of such knowledge, dependency parses show their effectiveness in supporting RE for its ability Figure 5: Visualizations of weights assigned to different dependency connections of A-GCN (Full) and A-GCN (L + G) for an example input, where darker and thicker lines refer to connections with higher weights.
in capturing long-distance word relations (Zhang et al., 2018;Guo et al., 2019). However, intensively leveraging dependency information could introduce confusions to RE (Xu et al., 2016b;Yu et al., 2020) so that necessary pruning is required to alleviate this problem. E.g., Xu et al. (2015) proposed to use the connections along the shortest dependency path between the two entities and apply LSTM to model them; Miwa and Bansal (2016) proposed to prune the original dependency tree into the lowest common ancestor subtree. However, these pruning strategies are either too aggressive or modest, so that the resulted graph might lose some important contexts or filled with more noise. Zhang et al. (2018) adopted GCN to model the dependencies and proposed a trade-off pruning strategy in between Xu et al. (2015) and Miwa and Bansal (2016). Besides, there are other graphbased models for RE that utilize layers of multihead attentions (Guo et al., 2019), dynamic pruning (Yu et al., 2020), and additional attention layers (Mandya et al., 2020) to encode dependency trees. Compared with the aforementioned methods, especially the graph-based ones, our approach offers an alternative to enhance RE with A-GCN by using attention mechanism and dependency type, which are effective and efficient improvement to standard GCN without requiring complicated model design.

Conclusion
In this paper, we propose A-GCN to leverage dependency information for relation extraction, where an attention mechanism is applied to dependency connections to applying weighting on both connections and types so as to better distinguish the important dependency information and leverage them accordingly. In doing so, A-GCN is able to dynamically learn from different depen-dency connections so that less-informative dependencies are smartly pruned. Experimental results and analyses on two English benchmark datasets for relation extraction demonstrate the effectiveness of our approach, especially for entities with long word-sequence distances, where state-of-theart performance is obtained on both datasets.    Table 7 reports the number of trainable parameters and the inference speed (sentences per second) of the baseline (i.e., BERT, BERT + GAT and BERT + GCN) and our models (i.e., BERT + A-GCN) on ACE2005 and SemEval datasets. All models are performed on an NVIDIA Tesla V100 GPU.

Appendix B. Model Size and Running Speed
Appendix C. Experimental Results on the Development Set

Appendix D. Mean and Deviation of the Results
In the experiments, we test models with different configurations. For each model, we train it with the best hyper-parameter setting using five different random seeds. We report the mean (µ) and standard deviation ( ) of the F1 scores on the test set of ACE2005 and SemEval in Table 9.   Table 8: F1 scores of our A-GCN models and the baselines (i.e., BERT-only, standard GAT, and standard GCN) under different settings with BERT-base and BERT-large on the development set of ACE2005. All graph-based models (i.e., GAT, GCN, and A-GCN) are tested with two settings: the first is using the full graph (FULL) with all dependency connections involved and the second is using the combination of local and global connections (L + G). We also run GCN and A-GCN with different numbers of layers (i.e., 1 to 3 layers) for fair comparisons.