Automated Concatenation of Embeddings for Structured Prediction

Pretrained contextualized embeddings are powerful word representations for structured prediction tasks. Recent work found that better word representations can be obtained by concatenating different types of embeddings. However, the selection of embeddings to form the best concatenated representation usually varies depending on the task and the collection of candidate embeddings, and the ever-increasing number of embedding types makes it a more difficult problem. In this paper, we propose Automated Concatenation of Embeddings (ACE) to automate the process of finding better concatenations of embeddings for structured prediction tasks, based on a formulation inspired by recent progress on neural architecture search. Specifically, a controller alternately samples a concatenation of embeddings, according to its current belief of the effectiveness of individual embedding types in consideration for a task, and updates the belief based on a reward. We follow strategies in reinforcement learning to optimize the parameters of the controller and compute the reward based on the accuracy of a task model, which is fed with the sampled concatenation as input and trained on a task dataset. Empirical results on 6 tasks and 21 datasets show that our approach outperforms strong baselines and achieves state-of-the-art performance with fine-tuned embeddings in all the evaluations.

Pretrained contextualized embeddings are powerful word representations for structured prediction tasks. Recent work found that better word representations can be obtained by concatenating different types of embeddings. However, the selection of embeddings to form the best concatenated representation usually varies depending on the task and the collection of candidate embeddings, and the everincreasing number of embedding types makes it a more difficult problem. In this paper, we propose Automated Concatenation of Embeddings (ACE) to automate the process of finding better concatenations of embeddings for structured prediction tasks, based on a formulation inspired by recent progress on neural architecture search. Specifically, a controller alternately samples a concatenation of embeddings, according to its current belief of the effectiveness of individual embedding types in consideration for a task, and updates the belief based on a reward. We follow strategies in reinforcement learning to optimize the parameters of the controller and compute the reward based on the accuracy of a task model, which is fed with the sampled concatenation as input and trained on a task dataset. Empirical results on 6 tasks and 21 datasets show that our approach outperforms strong baselines and achieves state-of-the-art performance with fine-tuned embeddings in all the evaluations. 1

Introduction
Recent developments on pretrained contextualized embeddings have significantly improved the performance of structured prediction tasks in natural * Yong Jiang and Kewei Tu are the corresponding authors. ‡ : This work was conducted when Xinyu Wang was interning at Alibaba DAMO Academy. 1 Our code is publicly available at https://github. com/Alibaba-NLP/ACE. language processing. Approaches based on contextualized embeddings, such as ELMo (Peters et al., 2018), Flair (Akbik et al., 2018), BERT (Devlin et al., 2019), and XLM-R (Conneau et al., 2020), have been consistently raising the state-of-the-art for various structured prediction tasks. Concurrently, research has also showed that word representations based on the concatenation of multiple pretrained contextualized embeddings and traditional non-contextualized embeddings (such as word2vec (Mikolov et al., 2013) and character embeddings (Santos and Zadrozny, 2014)) can further improve performance (Peters et al., 2018;Akbik et al., 2018;Straková et al., 2019;Wang et al., 2020b). Given the ever-increasing number of embedding learning methods that operate on different granularities (e.g., word, subword, or character level) and with different model architectures, choosing the best embeddings to concatenate for a specific task becomes non-trivial, and exploring all possible concatenations can be prohibitively demanding in computing resources.
Neural architecture search (NAS) is an active area of research in deep learning to automatically search for better model architectures, and has achieved state-of-the-art performance on various tasks in computer vision, such as image classification (Real et al., 2019), semantic segmentation (Liu et al., 2019a), and object detection (Ghiasi et al., 2019). In natural language processing, NAS has been successfully applied to find better RNN structures (Zoph and Le, 2017;Pham et al., 2018b) and recently better transformer structures (So et al., 2019;Zhu et al., 2020). In this paper, we propose Automated Concatenation of Embeddings (ACE) to automate the process of finding better concatenations of embeddings for structured prediction tasks. ACE is formulated as an NAS problem. In this approach, an iterative search process is guided by a controller based on its belief that models the ef-fectiveness of individual embedding candidates in consideration for a specific task. At each step, the controller samples a concatenation of embeddings according to the belief model and then feeds the concatenated word representations as inputs to a task model, which in turn is trained on the task dataset and returns the model accuracy as a reward signal to update the belief model. We use the policy gradient algorithm (Williams, 1992) in reinforcement learning (Sutton and Barto, 1992) to solve the optimization problem. In order to improve the efficiency of the search process, we also design a special reward function by accumulating all the rewards based on the transformation between the current concatenation and all previously sampled concatenations.
Our approach is different from previous work on NAS in the following aspects: 1. Unlike most previous work, we focus on searching for better word representations rather than better model architectures.
2. We design a novel search space for the embedding concatenation search. Instead of using RNN as in previous work of Zoph and Le (2017), we design a more straightforward controller to generate the embedding concatenation. We design a novel reward function in the objective of optimization to better evaluate the effectiveness of each concatenated embeddings.
3. ACE achieves high accuracy without the need for retraining the task model, which is typically required in other NAS approaches.
4. Our approach is efficient and practical. Although ACE is formulated in a NAS framework, ACE can find a strong word representation on a single GPU with only a few GPU-hours for structured prediction tasks. In comparison, a lot of NAS approaches require dozens or even thousands of GPU-hours to search for good neural architectures for their corresponding tasks.
Empirical results show that ACE outperforms strong baselines. Furthermore, when ACE is applied to concatenate pretrained contextualized embeddings fine-tuned on specific tasks, we can achieve state-of-the-art accuracy on 6 structured prediction tasks including Named Entity Recognition (Sundheim, 1995), Part-Of-Speech tagging (DeRose, 1988), chunking (Tjong Kim Sang and Buchholz, 2000), aspect extraction (Hu and Liu, 2004), syntactic dependency parsing (Tesnière, 1959) and semantic dependency parsing (Oepen et al., 2014) over 21 datasets. Besides, we also analyze the advantage of ACE and reward function design over the baselines and show the advantage of ACE over ensemble models.
2 Related Work

Embeddings
Non-contextualized embeddings, such as word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), and fastText (Bojanowski et al., 2017), help lots of NLP tasks. Character embeddings (Santos and Zadrozny, 2014) are trained together with the task and applied in many structured prediction tasks (Ma and Hovy, 2016;Lample et al., 2016;Dozat and Manning, 2018). For pretrained contextualized embeddings, ELMo (Peters et al., 2018), a pretrained contextualized word embedding generated with multiple Bidirectional LSTM layers, significantly outperforms previous state-of-the-art approaches on several NLP tasks. Following this idea, Akbik et al. (2018) proposed Flair embeddings, which is a kind of contextualized character embeddings and achieved strong performance in sequence labeling tasks. Recently, Devlin et al. (2019) proposed BERT, which encodes contextualized sub-word information by Transformers (Vaswani et al., 2017) and significantly improves the performance on a lot of NLP tasks. Much research such as RoBERTa (Liu et al., 2019c) has focused on improving BERT model's performance through stronger masking strategies. Moreover, multilingual contextualized embeddings become popular. Pires et al. (2019) and Wu and Dredze (2019) showed that Multilingual BERT (M-BERT) could learn a good multilingual representation effectively with strong cross-lingual zero-shot transfer performance in various tasks. Conneau et al. (2020) proposed XLM-R, which is trained on a larger multilingual corpus and significantly outperforms M-BERT on various multilingual tasks.

Neural Architecture Search
Recent progress on deep learning has shown that network architecture design is crucial to the model performance. However, designing a strong neural architecture for each task requires enormous efforts, high level of knowledge, and experiences over the task domain. Therefore, automatic design of neural architecture is desired. A crucial part of NAS is search space design, which defines the discoverable NAS space. Previous work (Baker et al., 2017;Zoph and Le, 2017;Xie and Yuille, 2017) designs a global search space (Elsken et al., 2019) which incorporates structures from hand-crafted architectures. For example, Zoph and Le (2017) designed a chained-structured search space with skip connections. The global search space usually has a considerable degree of freedom. For example, the approach of Zoph and Le (2017) takes 22,400 GPUhours to search on CIFAR-10 dataset. Based on the observation that existing hand-crafted architectures contain repeated structures (Szegedy et al., 2016;He et al., 2016;Huang et al., 2017), Zoph et al.
(2018) explored cell-based search space which can reduce the search time to 2,000 GPU-hours.
In recent NAS research, reinforcement learning and evolutionary algorithms are the most usual approaches. In reinforcement learning, the agent's actions are the generation of neural architectures and the action space is identical to the search space. Previous work usually applies an RNN layer (Zoph and Le, 2017;Zhong et al., 2018;Zoph et al., 2018) or use Markov Decision Process (Baker et al., 2017) to decide the hyper-parameter of each structure and decide the input order of each structure. Evolutionary algorithms have been applied to architecture search for many decades (Miller et al., 1989;Angeline et al., 1994;Stanley and Miikkulainen, 2002;Floreano et al., 2008;Jozefowicz et al., 2015). The algorithm repeatedly generates new populations through recombination and mutation operations and selects survivors through competing among the population. Recent work with evolutionary algorithms differ in the method on parent/survivor selection and population generation. For example, Real et al. (2017), Liu et al. (2018a, Wistuba (2018) and Real et al. (2019) applied tournament selection (Goldberg and Deb, 1991) for the parent selection while Xie and Yuille (2017)

Automated Concatenation of Embeddings
In ACE, a task model and a controller interact with each other repeatedly. The task model predicts the task output, while the controller searches for better embedding concatenation as the word representation for the task model to achieve higher accuracy.
Given an embedding concatenation generated from the controller, the task model is trained over the task data and returns a reward to the controller. The controller receives the reward to update its parameter and samples a new embedding concatenation for the task model. Figure 1 shows the general architecture of our approach.

Task Model
For the task model, we emphasis on sequencestructured and graph-structured outputs. Given a structured prediction task with input sentence x and structured output y, we can calculate the probability distribution P (y|x) by: where Y(x) represents all possible output structures given the input sentence x. Depending on different structured prediction tasks, the output structure y can be label sequences, trees, graphs or other structures. In this paper, we use sequencestructured and graph-structured outputs as two exemplar structured prediction tasks. We use BiLSTM-CRF model ( where V = [v 1 ; · · · ; v n ], V ∈ R d×n is a matrix of the word representations for the input sentence x with n words, d is the hidden size of the concatenation of all embeddings. The word representation v i of i-th word is a concatenation of L types of word embeddings: where embed l is the model of l-th embeddings, v i ∈ R d , v l i ∈ R d l . d l is the hidden size of embed l .

Search Space Design
The neural architecture search space can be represented as a set of neural networks (Elsken et al., 2019). A neural network can be represented as a directed acyclic graph with a set of nodes and directed edges. Each node represents an operation, while each edge represents the inputs and outputs between these nodes. In ACE, we represent each embedding candidate as a node. The input to the nodes is the input sentence x, and the outputs are the embeddings v l . Since we concatenate the embeddings as the word representation of the task model, there is no connection between nodes in our search space. Therefore, the search space can be significantly reduced. For each node, there are a lot of options to extract word features. Taking BERT embeddings as an example, Devlin et al. In NAS, weight sharing (Pham et al., 2018a) shares the weight of structures in training different neural architectures to reduce the training cost. In comparison, we fixed the weight of pretrained embedding candidates in ACE except for the character embeddings. Instead of sharing the parameters of the embeddings, we share the parameters of the task models at each step of search. However, the hidden size of word representation varies over the concatenations, making the weight sharing of structured prediction models difficult. Instead of deciding whether each node exists in the graph, we keep all nodes in the search space and add an additional operation for each node to indicate whether the embedding is masked out. To represent the selected concatenation, we use a binary vector a = [a 1 , · · · , a l , · · · , a L ] as an mask to mask out the embeddings which are not selected: where a l is a binary variable. Since the input V is applied to a linear layer in the BiLSTM layer, multiplying the mask with the embeddings is equivalent to directly concatenating the selected embeddings: where W =[W 1 ; W 2 ; . . . ; W L ] and W ∈R d×h and W l ∈R d l ×h . Therefore, the model weights can be shared after applying the embedding mask to all embedding candidates' concatenation. Another benefit of our search space design is that we can remove the unused embedding candidates and the corresponding weights in W for a lighter task model after the best concatenation is found by ACE.

Searching in the Space
During search, the controller generates the embedding mask for the task model iteratively. We use parameters θ = [θ 1 ; θ 2 ; . . . ; θ L ] for the controller instead of using the RNN structure applied in previous approaches (Zoph and Le, 2017;Zoph et al., 2018). The probability distribution of selecting an concatenation a is P ctrl (a; θ) = L l=1 P ctrl l (a l ; θ l ). Each element a l of a is sampled independently from a Bernoulli distribution, which is defined as: where σ is the sigmoid function. Given the mask, the task model is trained until convergence and returns an accuracy R on the development set. As the accuracy cannot be back-propagated to the controller, we use the reinforcement algorithm for optimization. The accuracy R is used as the reward signal to train the controller. The controller's target is to maximize the expected reward J(θ) = E P ctrl (a;θ) [R] through the policy gradient method (Williams, 1992). In our approach, since calculating the exact expectation is intractable, the gradient of J(θ) is approximated by sampling only one selection following the distribution P ctrl (a; θ) at each step for training efficiency: where b is the baseline function to reduce the high variance of the update function. The baseline usually can be the highest accuracy during the search process. Instead of merely using the highest accuracy of development set over the search process as the baseline, we design a reward function on how each embedding candidate contributes to accuracy change by utilizing all searched concatenations' development scores. We use a binary vector |a t − a i | to represent the change between current embedding concatenation a t at current time step t and a i at previous time step i. We then define the reward function as: where r t is a vector with length L representing the reward of each embedding candidate. R t and R i are the reward at time step t and i. When the Hamming distance of two concatenations Hamm(a t , a i ) gets larger, the changed candidates' contribution to the accuracy becomes less noticeable. The controller may be misled to reward a candidate that is not actually helpful. We apply a discount factor to reduce the reward for two concatenations with a large Hamming distance to alleviate this issue. Our final reward function is: where γ ∈ (0, 1). Eq. 4 is then reformulated as:

Training
To train the controller, we use a dictionary D to store the concatenations and the corresponding validation scores. At t = 1, we train the task model with all embedding candidates concatenated. From t = 2, we repeat the following steps until a maximum iteration T : 1. Sample a concatenation a t based on the probability distribution in Eq. 3.
2. Train the task model with a t following Eq. 1 and evaluate the model on the development set to get the accuracy R t .
3. Given the concatenation a t , accuracy R t and D, compute the gradient of the controller following Eq. 7 and update the parameters of controller.
4. Add a t and R t into D, set t = t + 1.
When sampling a t , we avoid selecting the previous concatenation a t−1 and the all-zero vector (i.e., selecting no embedding). If a t is in the dictionary D, we compare the R t with the value in the dictionary and keep the higher one.

Experiments
We use ISO 639-1 language codes to represent languages in the table 2 .

Datasets and Configurations
To show ACE's effectiveness, we conduct extensive experiments on a variety of structured prediction tasks varying from syntactic tasks to semantic tasks. The tasks are named entity recognition (NER), Part-Of-Speech (POS) tagging, Chunking, Aspect Extraction (AE), Syntactic Dependency Parsing (DP) and Semantic Dependency Parsing (SDP). The details of the 6 structured prediction tasks in our experiments are shown in below: • NER: We use the corpora of 4 languages from the CoNLL 2002 and 2003 shared task (Tjong Kim Sang, 2002;Tjong Kim Sang and De Meulder, 2003) with standard split.
• POS Tagging • Chunking: We use CoNLL 2000 (Tjong Kim Sang and Buchholz, 2000) for chunking. Since there is no standard development set for CoNLL 2000 dataset, we split 10% of the training data as the development set.
• Aspect Extraction: Aspect extraction is a subtask of aspect-based sentiment analysis (Pontiki et al., 2014(Pontiki et al., , 2015(Pontiki et al., , 2016 • Semantic Dependency Parsing: We use DM, PAS and PSD datasets for semantic dependency parsing (Oepen et al., 2014) for the SemEval 2015 shared task (Oepen et al., 2015). The three datasets have the same sentences but with different formalisms. We use the standard split for SDP. In the split, there are in-domain test sets and out-of-domain test sets for each dataset.
Among these tasks, NER, POS tagging, chunking and aspect extraction are sequence-structured outputs while dependency parsing and semantic dependency parsing are the graph-structured outputs. POS Tagging, chunking and DP are syntactic structured prediction tasks while NER, AE, SDP are semantic structured prediction tasks. We train the controller for 30 steps and save the task model with the highest accuracy on the development set as the final model for testing. Please refer to Appendix A for more details of other settings.

Embeddings
Basic Settings: For the candidates of embeddings on English datasets, we use the languagespecific model for ELMo, Flair, base BERT, GloVe word embeddings, fastText word embeddings, noncontextual character embeddings (Lample et al., 2016), multilingual Flair (M-Flair), M-BERT and XLM-R embeddings. The size of the search space in our experiments is 2 11 −1=2047 3 . For languagespecific models of other languages, please refer to Appendix A for more details. In AE, there is no available Russian-specific BERT, Flair and ELMo embeddings and there is no available Turkishspecific Flair and ELMo embeddings. We use the corresponding English embeddings instead so that the search spaces of these datasets are almost identical to those of the other datasets. All embeddings are fixed during training except that the character embeddings are trained over the task. The empirical results are reported in Section 4.3.1.
Embedding Fine-tuning: A usual approach to get better accuracy is fine-tuning transformer-based embeddings. In sequence labeling, most of the work follows the fine-tuning pipeline of BERT that connects the BERT model with a linear layer for word-level classification. However, when multiple embeddings are concatenated, fine-tuning a specific group of embeddings becomes difficult because of complicated hyper-parameter settings and massive GPU memory consumption. To alleviate this problem, we first fine-tune the transformer-based embeddings over the task and then concatenate these embeddings together with other embeddings in the basic setting to apply ACE. The empirical results are reported in Section 4.3.2.

Results
We use the following abbreviations in our experiments: UAS: Unlabeled Attachment Score; LAS: Labeled Attachment Score; ID: In-domain test set; OOD: Out-of-domain test set. We use language codes for languages in NER and AE.

Comparison With Baselines
To show the effectiveness of our approach, we compare our approach with two strong baselines. For the first one, we let the task model learn by itself the contribution of each embedding candidate that is helpful to the task. We set a to all-ones (i.e., the concatenation of all the embeddings) and train the task model (All). The linear layer weight W in Eq. 2 reflects the contribution of each candidate. For the second one, we use the random search (Random), a strong baseline in NAS (Li and Talwalkar, 2020). For Random, we run the same maximum iteration as in ACE. For the experiments, we report the averaged accuracy of 3 runs. Table 1 shows that ACE outperforms both baselines in 6 tasks over 23 test sets with only two exceptions. Comparing Random with All, Random outperforms All by 0.4 on average and surpasses the accuracy of All on 14 out of 23 test sets, which shows that concatenating all embeddings may not be the best solution to most structured prediction tasks. In general, searching for the concatenation for the word representation is essential in most cases, and our search design can usually lead to better results compared to both of the baselines.

Comparison With State-of-the-Art approaches
As we have shown, ACE has an advantage in searching for better embedding concatenations. We further show that ACE is competitive or even stronger than state-of-the-art approaches. We additionally use XLNet (Yang et al., 2019) and RoBERTa as the candidates of ACE. In some tasks, we have several additional settings to better compare with previous work. In NER, we also conduct a comparison on the revised version of German datasets in the CoNLL 2006 shared task (Buchholz and Marsi, 2006). Recent work such as Yu et al. (2020) and Yamada et al. (2020) utilizes document contexts in the datasets. We follow their work and extract document embeddings for the transformer-based embeddings. Specifically, we follow the fine-tune process of Yamada et al. (2020) to fine-tune the transformer-based embeddings over the document except for BERT and M-BERT embeddings. For BERT and M-BERT, we follow the document extraction process of Yu et al. (2020) because we find that the model with such document embeddings is significantly stronger than the model trained with the fine-tuning process of Yamada et al. (2020). In SDP, the state-of-the-art approaches used POS tags and lemmas as additional word features to the network. We add these two features to the embedding candidates and train the embeddings together with the task. We use the fine-tuned transformer-based embeddings on each task instead of the pretrained version of these embeddings as the candidates. 4 We additionally compare with fine-tuned XLM-R model for NER, POS tagging, chunking and AE, and compare with fine-tuned XLNet model for DP and SDP, which are strong fine-tuned models in most of the experiments. Results are shown in Table 2, 3, 4. Results show that ACE with fine-tuned embeddings achieves state-of-the-art performance in all test sets, which shows that finding a good embedding concatenation helps structured prediction tasks. We also find that ACE is stronger than the fine-tuned models, which shows the effectiveness of concatenating the fine-tuned embeddings 5 .

Efficiency of Search Methods
To show how efficient our approach is compared with the random search algorithm, we compare the algorithm in two aspects on CoNLL English NER dataset. The first aspect is the best development accuracy during training. The left part of Figure 2 shows that ACE is consistently stronger than the random search algorithm in this task. The second aspect is the searched concatenation at each time step. The right part of Figure 2 shows that the accuracy of ACE gradually increases and gets stable when more concatenations are sampled.

Ablation Study on Reward Function Design
To show the effectiveness of the designed reward function, we compare our reward function (Eq. 6) with the reward function without discount factor (Eq. 5) and the traditional reward function (reward term in Eq. 4). We sample 2000 training sentences on CoNLL English NER dataset for faster training and train the controller for 50 steps. Table 5 shows that both the discount factor and the binary vector |a t − a i | for the task are helpful in both development and test datasets.    The left y-axis is the averaged best validation accuracy on CoNLL English NER dataset. The right y-axis is the averaged validation accuracy of the current selection.

Comparison with Embedding Weighting & Ensemble Approaches
We compare ACE with two more approaches to further show the effectiveness of ACE. One is a variant of All, which uses a weighting parameter b = [b 1 , · · · , b l , · · · , b L ] passing through a sigmoid function to weight each embedding candidate. Such an approach can explicitly learn the weight of each embedding in training instead of a binary mask. We call this approach All+Weight. Another one is model ensemble, which trains the task model with each embedding candidate individually and uses the trained models to make joint prediction on the test set. We use voting for ensemble as it is simple and fast. For sequence labeling tasks, the models vote for the predicted label at each position. For DP, the models vote for the tree of each sentence. For SDP, the models vote for each potential labeled arc. We use the confi-dence of model predictions to break ties if there are more than one agreement with the same counts. We call this approach Ensemble. One of the benefits of voting is that it combines the predictions of the task models efficiently without any training process. We can search all possible 2 L −1 model ensembles in a short period of time through caching the outputs of the models. Therefore, we search for the best ensemble of models on the development set and then evaluate the best ensemble on the test set (Ensemble dev ). Moreover, we additionally search for the best ensemble on the test set for reference (Ensemble test ), which is the upper bound of the approach. We use the same setting as in Section 4.  Table 4: Comparison with state-of-the-art approaches in DP and SDP. † : For reference, they additionally used constituency dependencies in training. We also find that the PTB dataset used by Mrini et al. (2020) is not identical to the dataset in previous work such as Zhang et al. (2020) and Wang and Tu (2020). ‡ : For reference, we confirmed with the authors of He and Choi (2020) that they used a different data pre-processing script with previous work.
.   performs all the settings of these approaches and even Ensemble test , which shows the effectiveness of ACE and the limitation of ensemble models. All, All+Weight and Ensemble dev are competitive in most of the cases and there is no clear winner of these approaches on all the datasets. These results show the strength of embedding concatenation. Concatenating the embeddings incorporates information from all the embeddings and forms stronger word representations for the task model, while in model ensemble, it is difficult for the individual task models to affect each other.

Discussion: Practical Usability of ACE
Concatenating multiple embeddings is a commonly used approach to improve accuracy of structured prediction. However, such approaches can be computationally costly as multiple language models are used as input. ACE is more practical than concatenating all embeddings as it can remove those embeddings that are not very useful in the concatenation. Moreover, ACE models can be used to guide the training of weaker models through techniques such as knowledge distillation in structured prediction (Kim and Rush, 2016;Kuncoro et al., 2016;Wang et al., 2020aWang et al., , 2021b, leading to models that are both stronger and faster.

Conclusion
In this paper, we propose Automated Concatenation of Embeddings, which automatically searches for better embedding concatenation for structured prediction tasks. We design a simple search space and use the reinforcement learning with a novel reward function to efficiently guide the controller to search for better embedding concatenations. We take the change of embedding concatenations into the reward function design and show that our new reward function is stronger than the simpler ones. Results show that ACE outperforms strong baselines. Together with fine-tuned embeddings, ACE achieves state-of-the-art performance in 6 tasks over 21 datasets.

A Detailed Configurations
Evaluation To evaluate our models, We use F1 score to evaluate NER, Chunking and AE, use accuracy to evaluate POS Tagging, use unlabeled attachment score (UAS) and labeled attachment score (LAS) to evaluate DP, and use labeled F1 score to evaluate SDP.
Task Models and Controller For sequencestructured tasks (i.e., NER, POS tagging, chunking, aspect extraction), we use a batch size of 32 sentences and an SGD optimizer with a learning rate of 0.1. We anneal the learning rate by 0.5 when there is no accuracy improvement on the development set for 5 epochs. We set the maximum training epoch to 150. For graph-structured tasks (i.e., DP and SDP), we use Adam (Kingma and Ba, 2015) to optimize the model with a learning rate of 0.002. We anneal the learning rate by 0.75 for every 5000 iterations following Dozat and Manning (2017). We set the maximum training epoch to 300. For DP, we run the maximum spanning tree (McDonald et al., 2005) algorithm to output valid trees in testing. We fix the hyper-parameters of the task models.
We tune the learning rate for the controller among {0.1, 0.2, 0.3, 0.4, 0.5} and the discount factor among {0.1, 0.3, 0.5, 0.7, 0.9} on the same dataset in Section 5.2. We search for the hyperparameter through grid search and find a learning rate of 0.1 and a discount factor of 0.5 performs the best on the development set. The controller's parameters are initialized to all 0 so that each candidate is selected evenly in the first two time steps.
We use Stochastic Gradient Descent (SGD) to optimize the controller. The training time depends on the task and dataset size. Take the CoNLL English NER dataset as an example. It takes 45 GPU hours to train the controller for 30 steps on a single Tesla P100 GPU, which is an acceptable training time in practice.

Sources of Embeddings
The sources of the embeddings that we used are listed in Table 7.

B.1 Document-Level and Sentence-Level Representations
Recently, models with document-level word representations extracted from transformer-based embeddings significantly outperform models with sentence-level word representations in NER (Devlin et al., 2019;Yu et al., 2020;Yamada et al., 2020). However, there are a lot of application scenarios that document contexts are unavailable. We replace the document-level word representations from transformer-based embeddings (i.e., XLM-R and BERT embeddings) with the sentence-level word representations. Results are shown in Table  8. We report the test results of All to show how the gap between ACE and All changes with different kinds of representations. We report the test accuracy of the models with the highest development accuracy following Yamada et al. (2020) for a fair comparison. Empirical results show that the document-level representations can significantly improve the accuracy of ACE. Comparing with models with sentence-level representations, the averaged accuracy gap between ACE and All is enhanced from 0.7 to 1.7 with document-level representations, which shows that the advantage of ACE becomes stronger with document-level representations.

B.2 Fine-tuned Models Versus ACE
To fine-tune the embeddings, we use AdamW (Loshchilov and Hutter, 2018) optimizer with a learning rate of 5 × 10 −6 and trained the contextualized embeddings with the task for 10 epochs. We use a batch size of 32 for BERT, M-BERT and use a batch size of 4 for XLM-R, RoBERTa and XLNet. A comparison between ACE and the finetuned embeddings that we used in ACE is shown in Table 9, 10. Results show that ACE can further improve the accuracy of fine-tuned models. huggingface.co/xlnet-large-cased

B.3 Retraining
Most of the work (Zoph and Le, 2017;Zoph et al., 2018;Pham et al., 2018b;So et al., 2019;Zhu et al., 2020) in NAS retrains the searched neural architecture from scratch so that the hyper-parameters of the searched model can be modified or trained on larger datasets. To show whether our searched embedding concatenation is helpful to the task, we retrain the task model with the embedding concatenations on the same dataset from scratch. For the experiment, we use the same dataset settings as in Section 4.3.1. We train the searched embedding concatenation of each run from ACE 3 times (therefore, 9 runs for each dataset). Table 12 shows the comparison between retrained models with the searched embedding concatenation from ACE and All. The results show that the retrained models are competitive with ACE in SDP and in chunking. However, in another three tasks, the retrained models perform inferior to ACE. The possible reason is that the model at each step is initialized by the trained model of previous step. The retrained models outperform All in all tasks, which shows the effectiveness of the searched embedding concatenations.

B.4 Effect of Embeddings in the Searched Embedding Concatenations
There is no clear conclusion on what concatenation of embeddings is helpful to most of the tasks. We analyze the best searched embedding concatenations by ACE over different structured outputs, semantic/syntactic type, and monolingual/multilingual tasks. The percentage of each embedding selected by the best concatenations from all experiments of ACE are shown in Table 13. The best embedding concatenation varies over the output structure, syntactic/semantic level of understanding, and the language. The experimental results show that it is essential to select embeddings for each kind of task separately. However, we also find that the embeddings are strong in specific settings. In comparison to the sequence-structured and graph-structured tasks, we find that M-BERT and ELMo are only frequently selected in sequencestructured tasks while XLM-R embeddings are always selected in graph-structured tasks. For Flair embeddings, the forward and backward model are evenly selected. We suspect one direction of Flair embeddings is strong enough. Therefore concatenating the embeddings from two directions together cannot further improve the accuracy. For non-contextualized embeddings, pretrained word embeddings are frequently selected in sequencestructured tasks, and character embeddings are not. When we dig deeper into the semantic and syntactic type of these two structured outputs, we find that in all best concatenations, BERT embeddings are selected in all syntactic sequence-structured tasks, and Flair, M-Flair, word, and XLM-R embeddings are selected in syntactic graph-structured tasks. In multilingual tasks, all best concatenations in multilingual NER tasks select M-BERT embeddings while M-BERT is rarely selected in multilingual AE tasks. The monolingual Flair embeddings are always selected in NER tasks, and XLM-R is more 2659 frequently selected in multilingual tasks than monolingual sequence-structured tasks (SS).