AutoRC: Improving BERT Based Relation Classification Models via Architecture Search

Although BERT based relation classification (RC) models have achieved significant improvements over the traditional deep learning models, it seems that no consensus can be reached on what is the optimal architecture, since there are many design choices available. In this work, we design a comprehensive search space for BERT based RC models and employ a modified version of efficient neural architecture search (ENAS) method to automatically discover the design choices mentioned above. Experiments on eight benchmark RC tasks show that our method is efficient and effective in finding better architectures than the baseline BERT based RC models. Ablation study demonstrates the necessity of our search space design and the effectiveness of our search method. We also show that our framework can also apply to other entity related tasks like coreference resolution and span based named entity recognition (NER).


Introduction
The task of relation classification (RC) is to predict semantic relations between pairs of entities inside a context. It is an important NLP task since it serves as an intermediate step in variety of NLP applications. There are many works that apply deep neural networks (DNN) to relation classification (Socher et al., 2012;Zeng et al., 2014;Shen and Huang, 2016). With the rise of pre-trained language models (PLMs) (Devlin et al., 2018), a series of literature have incorporated PLMs such as BERT in RC tasks (Baldini Soares et al., 2019;Wu and He, 2019;Eberts and Ulges, 2019;Peng et al., 2019), and shows significant improvements over the traditional DNN models.
Despite great success, there is yet no consensus reached on how to represent the entity pair and their * Contact: 52205901018@stu.ecnu.edu.cn. contextual sentence for a BERT based RC model. First, Baldini Soares et al. (2019) and Peng et al. (2019) use different entity identification methods. Second, Baldini Soares et al. (2019) and Wu and He (2019) use different aggregation methods of entity representations and contexts. Third, choosing which features should be considered for the classfication layer should also be determined (Eberts and Ulges, 2019). In addition, previous literature does not consider the interactions between the feature vectors.
In this work, we experiment on making the design choices in the BERT based RC model automatically, so that one can obtain an architecture that better suits the task at hand (Figure 1). Throughout this work, we will refer to our framework as AutoRC, which includes our search space and search method. Firstly, a comprehensive search space for the design choices that should be considered in a BERT based RC model is established. Second, to navigate on our search space, we employ reinforcement learning (RL) strategy following ENAS . That is, a controller generates new RC architectures, receives rewards, and updates its policy via policy gradient method. To stabilize and improve the search results, three non-trivial modifications to ENAS are proposed: a) heterogeneous parameter sharing, which is to share parameters more deeply than ENAS if the modules play similar role, and not to share if not; b) maintain multiple copies of the shared parameters which will be drawn randomly to the child models; c) search warm-ups, which is to generate and update child models without updating the controller at the beginning of the search stage.
Experiments on eight benchmark RC tasks show that our method can outperform the standard BERT based RC models. Transfer of the learned architecture across different tasks is investigated, which shows the transferred architectures can outperform the baseline models but cannot outperform the architecture learned on this task. Ablation study of the search space demonstrates the validity of the search space design. In addition, ablation studies on the search space show the validity of our search space design, and experiments show that our proposed modifications to ENAS are effective. We also show our framework can work effectively on other entity related tasks like coreference resolution and span based NER.
The contributions of the paper can be summarized as: • We develop a comprehensive search space and improve the BERT based RC models, in which alternatives of the input formats and the aggregation layers are applicable to other tasks.
• As far as we know, we are the first to introduce NAS for BERT based models. Our proposed methods for improving search results are effective and universally applicable.

Related Work
Our work is closely related to the literature on neural architecture search (NAS). The field of NAS has attracted a lot of attentions in the recent years. The goal is to find automatic mechanisms for generating new neural architectures to replace conventional handcrafted ones, or automatically deciding optimal design choices instead of manually tuning (Bergstra et al., 2011). Recently, it has been widely applied to computer vision tasks, such as image classification (Cai et al., 2018), semantic segmentation , object detection (Ghiasi et al., 2019), super-resolution (Ahn et al., 2018), etc. However, NAS is less well studied in the field of natural language processing (NLP), especially in information extraction (IE). Recent works (Zoph and Le, 2017; search new recurrent cells for the language modeling (LM) tasks. The evolved transformer (So et al., 2019) employs an evolution-based search algorithm to generate better transformer architectures for machine translation tasks. Zhu et al. (2021) develops a novel search space which incorporates cross-sentence attention mechanism and are able to find novel architectures for natural language understanding (NLU) tasks. In this work, we design a method that incorporate NAS to improve BERT based relation extraction models. Our work is closely related to literatures on relation extraction, especially the recent ones that take advantages of the pre-trained language models (PLMs). In terms of entity span identification, Baldini Soares et al. (2019) argues that adding entity markers to the input tokens works best, while Peng et al. (2019) shows that some RC tasks are in favor of replace entity mentions with special tokens. For feature selection, Baldini Soares et al. (2019) shows that aggregating the entity representations via start pooling works best across a panel of R-C tasks. Meanwhile, Wu and He (2019) chooses average pooling for entity features. In addition, it argues that incorporating the representation of the [CLS] token is beneficial. Eberts and Ulges (2019) shows that the context between two entities serves as a strong signal on some RC task. Zhu (2020) shows that pre-training with entity spans can benefit the downstream tasks. In this work, we provide a more comprehensive overview of the design choices in BERT based RC models, and provide a solution for efficient and task-specific architecture discovery, thus alleviating NLP practitioner in the field of RE from manually or simple heuristic model tuning.

Search space for RC model
An overall architecture design for a RC model is shown in Figure 1. Following its bottom-up work-  flow, we will define the search space for AutoRC.

Formal definition of task
In this paper, we focus on learning mappings from relation statements to relation representations. Formally, let x = [x 0 , ..., x n ] be a sequence of tokens, and entity 1 (e 1 ) and entity 2 (e 2 ) to be the entity mentions, which is depicted at the bottom of Figure 1. The position of e i in x is denoted by the start and end position, s i = (e s i , e e i ). A relation statement is a triple r = (x, e 1 , e 2 ). Our goal is to learn a function f θ that maps the relation statement to a fixed-length vector h r = f θ (r) ∈ R d that represents the relation expressed in r.
Note that the two entities divide the sentence into five parts, e 1 and e 2 as entity mentions, and three contextual pieces, denoted as c 0 , c 1 and c 2 .

Entity span identification
In this work, we employ BERT (Devlin et al., 2018) as the encoder for the input sentences. The BERT encoder may need to distinguish the entity mentions from the context sentence to properly model the semantic representations of a relation statement. We present three different options for getting information about the entity spans s 1 and s 2 into our BERT encoder, which are depicted in Figure 2.
standard, that is, not to make any change to the input sentence (Figure 2(a)). entity markers. We add special tokens at the start and end of the entities to inform BERT where the two entities are in the sentence, as depicted by Figure 2 entity tokens. This approach (Figure 2(c)) replaces the entity mentions in the sentence with special tokens.
Formally, x becomes

Entity positional encoding
To make up for the standard input's lack of entity identification, or to further address the position of entities, one can add special entity positional encoding accompany input sequence x. As is shown in Figure 3, for entity 1, the entity positional encoding will be the distance to entity 1's starting token. 1 Now there are two design choices. First is whether to use entity positional encoding at all. Second, as is shown in Figure 1 if using entity positional encoding, do we add this extra embedding to the embedding layer of the BERT (denoted as add to embedding), or do we concatenate this embedding to the output of BERT encoder (denoted as concat to output)?

Pooling layer
How to aggregate the entities' and contexts' hidden representations into fixed length feature vectors, i.e., what kind of poolers are used becomes the core part of the RC model architecture. In this work, we investigate 5 different poolers: average pooling (avg pool), max pooling (denoted as max pool), self-attention pooling (denoted as self attn pool), dynamic routing pooling (dr pool) (Gong et al., 2018), and start pooling (start pool), which is to use the reprsentation of the starting token as in Baldini Soares et al. (2019).

Output features
To select appropriate features for classifying relation types, there are many design choices. First, whether the two entity vectors should be used as features. Second, whether each contextual piece (c 0 , c 1 , c 2 ) should be added as features (Eberts and Ulges, 2019;Wu and He, 2019).
We notice that the literature does not consider the interactions of the features from different parts of the sentence, which proves to be useful in other tasks such as natural language inference (NLI) (Chen et al., 2016). Here, we consider the interaction between the two entities, and their interactions with contextual pieces. The interaction can be dot product (denoted as dot) or absolute difference (denoted as minus) between two feature vectors.

Search space
Now we are ready to define the search space formally. The search space is as follows: • entity span identification = entity markers, entity tokens, standard; • how to use entity positional embedding = null, add to embedding, concat to output; • poolers for entity or contextual piece = avg pool, max pool, self attn pool, dr pool, start pool; • whether to use the representation of entity e i = True, False, where i = 1, 2; • whether to use the representation of context c i = True, False, where i = 0, 1, 2; • Interaction between the two entities = dot, minus, null, where null means no interaction; • Interaction between entity and contextual piece c i = dot, minus, null, where null means no interaction, and i = 0, 1, 2.
Our search space contains 1.64e+8 combinations of design choices, which makes manually finetuning or random search impractical.

Search method
In this section, we first formally formulate the problem of architecture search with reinforcement learning. Then, , we discuss the search algorithm based on policy gradient. At the last part, we discuss our modifications to stabilize the search outputs.

Problem formulation
Given a search space M of neural architectures, and a dataset split into train set D train and D valid , we aim to find the best architecture m * ∈ M that (1) Figure 4 shows the reinforcement learning framework used to solve Eq 1 by continuously sampling architectures m ∈ M and evaluating the reward (performance score) R on the validation set D valid . First, the recurrent network generates a network description m ∈ M that corresponds to a RC model. Then, the generated model m is trained on D train and tested on the validation set D valid . The test result is taken as a reward signal R to update the controller.

Search and evaluation
The whole procedure for model search can be divided into the search phase and evaluation phase. The search phase updates the shared parameters and the parameters for the controller in an interleaving manner, while the evaluation phase obtains multiple top-ranked models from the controller and train them till convergence on the task dataset for proper evaluations of the learned architectures.
Parameter sharing. In order to avoid training from scratch to obtain reward signals, parameter sharing is applied. The same operator is re-used for a child model if it is chosen. Specific to our architecture, the BERT encoder and the final classifier are shared for all child models. We denote the collection of all the parameters shared as Φ.
Search phase. Now we describe the interleaving optimization procedure. First, an architecture is sampled by the controller, and its network parameters are initialized with Φ. It is trained for n c steps (which is usually a small integer), during which Φ is updated. Then, the reward of this model is obtained on D valid . With n reward signals receive, Θ is updated using policy gradients following RE-INFORCE (Williams, 1992): where b denotes a moving average of the past rewards and it is used to reduce the variance of gradient approximation. In this work, we find n = 1 already works quite well. Repeating this interleaving optimization procedure for N times till the controller is well trained, then we generate k candidate architectures, evaluate them using the shared parameters, and then select the top-ranked k e models for architecture evaluation.
Evaluation phase. In this phase, the top-ranked models are trained with the whole train set, and validated on the dev set to select the best checkpoint for prediction on the test set. Note that the shared parameters Φ are discarded in this phase, and the learned architecture is trained from scratch. To fully evaluate each architecture, we run a grid search for the optimal hyper-parameters including learning rate, batch size and warm-up steps. After the optimal combination of hyper-parameters is selected, the model is run several times to ensure replication.

Improving search
Now we propose a few methods to stabilize the search results and improve the search performance.
Heterogeneous parameter sharing. First, the reward signals directly relies on the parameter sharing mechanism, thus we should think deeper into how to design proper parameter sharing strategies for RC model search. Parameter sharing in ENAS is unconditional. Note that to much or too little parameter sharing can generate un-reliable reward signals, guiding the controller to wrong directions. Thus based on our extensive experiments, we now present our parameter sharing strategies, which we will call heterogeneous parameter sharing, since our idea is to share parameters among modules that plays similar roles in the model architectures. The details are as follows: (a) first, note that the entity span identification method entity tokens significantly alter the original sentence, thus, it is natural for it to use a different BERT encoder in the child models. (b) since entities and contexts play quite different roles in the RC tasks, the aggregators for entities and contexts will not share parameters. Note that start pooler and dr pooler have a common component, which is a linear layer followed by a non-linear module, thus the linear layer will be shared in these two aggregators for entities or for contexts. However, we will use the linear layer of the BERT pooler to initialize all the linear layers of start pooler and dr pooler.
Multiple copies of shared parameters. Note that all child models have a BERT encoder and a classifier layer, thus parameters in these modules may over-fit quickly. Thus, during search training, we maintain multiple copies of these modules, and each time we initialize a child model, a copy of BERT encoder and classifier layer will be randomly selected from shared parameters Φ. After updating, these copies will be stored back to Φ.
Search warm-ups At the beginning of training, the shared parameters are not trained, thus reward signals generated are unreliable. Thus, at the first few epochs, the controller will generate child models to train on the dataset, but it will not be updated.

Experiments
Due to resource limitations, we assign up to 2 N-VIDIA V100 GPU cards to each tasks.

Search protocol
During search phase, the interleaving optimization process is run 100 epochs. Throughout this work, we use the base uncased version of BERT (Devlin et al., 2018) as the sentence encoder, and its parameters are fine-tuned to better adjust to downstream tasks. During search, 4 copies of BERT model checkpoints are maintained, 2 for method entity tokens and 2 for the other two entity span identifiers, so each time we initialize a child model, a BERT checkpoint is randomly selected and its parameters can be updated. If the entity position embedding is concatenated after the BERT output, its size is set to be 12. During search, each child model is trained with 4 batches of training data and evaluated on a single batch of valid data, and the evaluation batch size is 4 times the training batch size. The learning rate for the controller is set at 1e-4, and the learning rate and batch size for the sampled architectures are manually tuned to obtain better search results. During search, the number of warm-up steps for the BERT encoders is set to be equal to 0.8 of a epoch, and the warm-up steps for search is set to be 1.5 epochs.

Architecture evaluation protocol
In this work, we differentiate between a NAS method's performance and that of a learned model. We obtain the former by running architecture search 5 times. The best learned model's performance will be regarded as the NAS method's performance in each run. The best learned model in each search is also run for 10 times.
To make our results more reproducible, each learned model or each baseline model is trained for 10 times, and the mean and variance of the performance will be reported. And for evaluating the search method, after the search phase, 30 model architectures are sampled from the trained controller, and they are ranked via their performance on the valid data when they are initialized using the shared parameters. Then the top-ranked 5 models are trained from scratch till convergence on the whole training data of the task to formally evaluate their performances. The best learned model's performance of a search run is regarded as the search method's performance. In this work, we will report the mean and standard deviation of the search method performances in 5 independent runs.
To compare our methods with random search, for each task, we randomly samples 10 different models with a randomly initialized controller, since the GPU time for training 10 models is guaranteed to be larger than an entire search and evaluation process described above.
To thoroughly evaluate a learned model or a baseline model, we run a random search of 10 times on the following space for the optimal combination of the following key hyper-parameters: • learning rate = 1e-4, 5e-5, 2e-5, 1e-5; • training batch size = 128, 64, 32; • warm-up steps = 0.8, 1.0 of the number of steps in an epoch.
The hyper-params for the baseline models are reported in the Appendix.

Baseline models
In this work, we select two strong baselines for comparison. The first one is BERT-entity, the best model from Baldini Soares et al. (2019). The second is R-BERT by Wu and He (2019). BERTentity and R-BERT are implemented by Open-NRE (Han et al., 2019). The two models are special cases in our search space. The baseline models also have to go through the above reproducibility protocols. We will not compare with traditional deeplearning based model in the pre-BERT era, since BERT-entity significantly outperforms them. 3

Results on Benchmark datasets
The results on the 8 benchmarks RC datasets are reported in Table 1. We report both the performance of the search methods and the performance of the best model learned on each task using AutoRC. For all eight tasks, AutoRC successfully obtains higher average scores than the baseline models. In addition, we find that AutoRC outperforms naive ENAS and random search and its results are more stable. In addition, we can see that the best learned model outperforms the baseline models significantly. One observation can be made is that the test results of the search architectures are consistently stable than the baseline, which also validates that our method are efficient in finding a task-specific model for the task at hand. Figure 5, 6 and 7 report the best searched architectures for the deft2020, i2b2 and kbp37 tasks. We can see that learned architectures can be quite Figure 5: AR def t2020 , the best learned architecture on deft2020. Figure 6: AR i2b2 , the best learned architecture on i2b2.
different, thus validating the necessity of task specificity. The learned models are different in the following three aspects. First, AR def t2020 choose to replace entity mentions with entity tokens. We hypothesis that in deft-2020, the entities are often quite long, thus replacing entity mentions with entity tokens is beneficial for the model to understand the contexts' structural patterns. Second, note that AR def t2020 uses start pool to aggregate context piece c 0 , which is the representation of [CLS] token. In addition, it includes the representation of context c 1 , which is also used in AR kbp37 . Third, AR def t2020 incorporates the interaction between context c 0 and the two entities, while AR i2b2 and AR kbp37 include the interaction between the two entities. Differences in the learned architectures for different tasks indicate the necessity of task specific architectures, which is challenging without the help of NAS. We believe there are two aspects that can affect the learned models. First, different domains have different contexts, which may lead to different models. Second, the formulation of data. For example, in deft-2020, some extended definitions of scientific concepts are annotated as entities. Thus, the avg entity mention length (18.5) is quite different from other tasks (2.3 in "ddi").
In Table 1, we also study how does an architecture learned on one task performs on another.  Note that when evaluated on a different task, an architecture's hyper-parameters are tuned again, following the procedure described in subsection 5.3. The architecture learned on kbp37, which is an open-domain dataset, AR kbp37 , transfer well on wiki80. But it does not perform well on the two tasks of medical domain, i2b2 and ddi. However, the learned architectures learned on i2b2 and ddi transfer well on each other and perform comparably well. The above results demonstrate that the learned models have certain ability for task transfer, but its suitability is significantly affected by the domains of the tasks.

Ablation study on the search space
We further investigate the specific contributions by the different components of the search space. For this purpose, we create three smaller search space. The first one, denoted as M 1 , which does not allow any interactions among entity features and context features. The second one, M 2 further reduce M 1 by limiting that the pooling operation available is the start pooling operation. The third one, M 3 , further forbid contextual features. If further limit the entity span identification method to be entity markers, the search space is reduced to the baseline BERT-entity model. The search and evaluation protocols on the reduced search space strictly follow the previous subsections.
Ablation study for the search space is done on deft2020 and i2b2. Results are reported in Table 2. For deft2020, alternating the method for span identification provides significant performance gain on deft2020, and interaction among features is also important. For i2b2, the most significant performance drop occurs when the pooling operations are limited, indicating that even for powerful bi-directional context encoder like BERT, considering different pooling operations are beneficial.

Ablations on the modifications for search method
In this subsection, we will show that our modifications to the search method, i.e., the naive E-  NAS, are indeed effective and necessary. Here we use AutoRC to denote our method, which is the combination of ENAS and our proposed modifications. We now experiment on three variations to AutoRC. First, AutoRC 1 drops heterogeneous parameter sharing, that is, all input formats share the same BERT encoder, and all context and all entity representations share the same aggregators. The second variant, AutoRC 2 , is to maintain single copies of shared weights. The third variant, AutoRC 3 , is the one that drops search warm-ups. The average search performance, which is the average score of the best learned model at each search run, and their standard deviations are reported on Table 3. From the results, dropping any of three strategies we propose results in performance drop and increased variance in results. And changing the parameter sharing strategies cause the most significant performance drops on both tasks. The above results demonstrate that our proposed modifications make the reward signal during search more reliable, thus resulting in better searched architectures.

Applications to other entity related tasks
In Table 4, we apply our AutoRC framework to the other two entity related tasks, i.e., coreference resolution and span based NER. AutoRC can directly apply to coreference resolution since it essentially asks the model to determine whether an expression refers to an entity. It can also be applied to span based NER since it asks the model to determine whether a span in the sentence is an entity.
We experiment on the OntoNotes coreference resolution benchmark (Pradhan et al., 2012). The metric is MUC F1 and we choose the state-of-theart (SOTA) SpanBERT  as base-line. The results show that our AutoRC framework can effectively improve the performances of the SpanBERT checkpoint.
We experiment on the NER task of CoNLL04 (Roth and tau Yih, 2004), which uses entity level F1 as metric. Eberts and Ulges (2020) provides a SOTA baseline. The results show that performance improves via AutoRC.

Conclusion
In this work, we first construct a comprehensive search space to include many import design choices for a BERT based RC model. Then we design an efficient search method with the help of RL to navigate on this search space. To improve the search results, parameter sharing strategies different from ENAS are designed. To avoid over-fitting, we maintain multiple copies of shared weights during search. To stabilize the reward signal, search warm-ups are applied. Experiments on eight benchmark RC tasks show that our method can outperform the standard BERT based RC model significantly. Ablation study shows our search space design and proposed modifications are effective. 5e-5 64 1.0 wiki80 R-BERT 5e-5 128 0.8 BERT-entity 2e-5 64 1.0 AR wiki80 2e-5 64 1.0 deft2020 R-BERT 1e-4 64 0.8 BERT-entity 5e-5 64 1.0 AR def t2020 1e-4 64 0.8 i2b2 R-BERT 2e-5 32 0.8 BERT-entity 5e-5 32 0.8 AR i2b2 1e-5 32 0.8 ddi R-BERT 5e-5 64 0.8 BERT-entity 2e-5 32 0.8 AR ddi 5e-5 64 1.0 chemprot R-BERT 5e-5 64 0.8 BERT-entity 1e-5 128 0.8 AR chemprot 5e-5 64 1.0 Table 6 are learning rate (lr), batch size (bsz) and warmup steps (warm-up) for finetuning. Warm-up is reported as the proportion of steps in one epoch. One common hyper-parameter is the max sequence length, which is set as 256.