FAME: Feature-Based Adversarial Meta-Embeddings for Robust Input Representations

Combining several embeddings typically improves performance in downstream tasks as different embeddings encode different information. It has been shown that even models using embeddings from transformers still benefit from the inclusion of standard word embeddings. However, the combination of embeddings of different types and dimensions is challenging. As an alternative to attention-based meta-embeddings, we propose feature-based adversarial meta-embeddings (FAME) with an attention function that is guided by features reflecting word-specific properties, such as shape and frequency, and show that this is beneficial to handle subword-based embeddings. In addition, FAME uses adversarial training to optimize the mappings of differently-sized embeddings to the same space. We demonstrate that FAME works effectively across languages and domains for sequence labeling and sentence classification, in particular in low-resource settings. FAME sets the new state of the art for POS tagging in 27 languages, various NER settings and question classification in different domains.


Introduction
Recent work on word embeddings and pre-trained language models has shown the large impact of language representations on natural language processing (NLP) models across tasks and domains (Devlin et al., 2019;Beltagy et al., 2019;Conneau et al., 2020). Nowadays, a large number of different embedding models are available with different characteristics, such as different input granularities (word-based (e.g., Mikolov et al., 2013;Pennington et al., 2014) vs. subword-based (e.g., Heinzerling andStrube, 2018;Devlin et al., 2019) vs. characterbased (e.g., Lample et al., 2016;Ma and Hovy, 2016;Peters et al., 2018)), or different data used for pre-training (general-world vs. specific domain). Since those characteristics directly influence when embeddings are most effective, combinations of different embedding models are likely to be beneficial (Tsuboi, 2014;Kiela et al., 2018;Lange et al., 2019b), even when using already powerful largescale pre-trained language models (Akbik et al., 2018;Yu et al., 2020). Word-based embeddings, for instance, are strong in modeling frequent words while character-based embeddings can model outof-vocabulary words. Similarly, domain-specific embeddings can capture in-domain words that do not appear in general domains like news text.
Different word representations can be combined using so-called meta-embeddings. There are several methods available, ranging from concatenation (e.g., Yin and Schütze, 2016), over averaging (e.g., Coates and Bollegala, 2018) to attentionbased meta-embeddings (Kiela et al., 2018). However, they all come with shortcomings: Concatenation leads to high-dimensional input vectors and, as a result, requires additional parameters in the first layer of the neural network. Averaging simply merges all information into one vector, not allowing the network to focus on specific embedding types which might be more effective than others to represent the current word. Attention-based embeddings address this problem by allowing dynamic combinations of embeddings depending on the current input token. However, the calculation of attention weights requires the model to assess the quality of embeddings for a specific word. This is arguably very challenging when embeddings of different input granularities are combined, e.g., subwords and words. Infrequent in-domain tokens, for instance, are hard to detect when using subword-based embeddings as they can model any token. Moreover, both average and attention-based meta-embeddings require a mapping of all embeddings into the same space which can be challenging for a set of embeddings with different dimensions.
In this paper, we propose feature-based adversarial meta-embeddings (FAME) that (1) align the embedding spaces with adversarial training, and (2) use attention for combining embeddings with a layer that is guided by features reflecting wordspecific properties, such as the shape or frequency of the word and, thus, can help the model to assess the quality of the different embeddings. By using attention, we avoid the shortcomings of concatenation (high-dimensional input vectors) and averaging (merging information without focus). Further, our contributions mitigate the challenges of previous attention-based meta-embeddings: In our analysis, we show that the first contribution is especially beneficial when embeddings of different dimensions are combined. The second helps, in particular, when combining word-based with subwordbased embeddings.
We conduct experiments across a variety of tasks, languages and domains, including sequencelabeling tasks (named entity recognition (NER) for four languages, concept extraction for two special domains (clinical and materials science), and partof-speech tagging (POS) for 27 languages) and sentence classification tasks (question classification in different domains). Our results and analyses show that FAME outperforms existing meta-embedding methods and that even powerful fine-tuned transformer models can benefit from additional embeddings using our method. In particular, FAME sets the new state of the art for POS tagging in all 27 languages, for NER in two languages, as well as on all tested concept extraction and two question classification datasets.
In summary, our contributions are metaembeddings with (i) adversarial training and (ii) a feature-based attention function. (iii) We perform broad experiments, ablation studies and analyses which demonstrate that our method is highly effective across tasks, domains and languages, including low-resource settings. (iv) Moreover, we show that even representations from large-scale pretrained transformer models can benefit from our meta-embeddings approach. The code for FAME is publicly available 1 and compatible with the flair framework (Akbik et al., 2018).

Related Work
This section surveys related work on metaembeddings, attention and adversarial training.
Meta-Embeddings. Previous work has seen performance gains by, for example, combining various types of word embeddings (Tsuboi, 2014) or the same type trained on different corpora (Luo et al., 2014). For the combination, some alternatives have been proposed, such as different input channels of a convolutional neural network (Kim, 2014;Zhang et al., 2016), concatenation followed by dimensionality reduction (Yin and Schütze, 2016) or averaging of embeddings (Coates and Bollegala, 2018), e.g., for combining embeddings from multiple languages (Lange et al., 2020b;Reid et al., 2020). More recently, auto-encoders (Bollegala and Bao, 2018;Wu et al., 2020), ensembles of sentence encoders (Poerner et al., 2020) and attentionbased methods (Kiela et al., 2018;Lange et al., 2019a) have been introduced. The latter allows a dynamic (input-based) combination of multiple embeddings. Winata et al. (2019) and Priyadharshini et al. (2020) used similar attention functions to combine embeddings from different languages for NER in code-switching settings. Liu et al. (2021) explored the inclusion of domain-specific semantic structures to improve meta-embeddings in nonstandard domains. In this paper, we follow the idea of attention-based meta-embeddings and propose task-independent methods for improving them.
Extended Attention. Attention has been introduced in the context of machine translation (Bahdanau et al., 2015) and is since then widely used in NLP (i.a., Tai et al., 2015;Xu et al., 2015;Yang et al., 2016;Vaswani et al., 2017). Our approach extends this technique by integrating word features into the attention function. This is similar to extending the source of attention for uncertainty detection (Adel and Schütze, 2017) or relation extraction (Zhang et al., 2017b;Li et al., 2019). However, in contrast to these works, we use task-independent features derived from the token itself. Thus, we can use the same attention function for different tasks.
Adversarial Training. Further, our method is motivated by the usage of adversarial training (Goodfellow et al., 2014) for creating input representations that are independent of a specific domain or feature. This is related to using adversarial training for domain adaptation (Ganin et al., 2016) or coping with bias or confounding variables (Li et al., 2018;Raff and Sylvester, 2018;Zhang et al., 2018;Barrett et al., 2019;McHardy et al., 2019). Following Ganin et al. (2016), we use gradient reversal training in this paper. Recent studies use adversar--Frequency: High-frequency words can typically be modeled well by word-based embeddings, while low-frequency words are better captured with subword-based embeddings. Moreover, frequency is domain-dependent and can thus help to decide between embeddings from different domains. We estimate the frequency n of a word in the general domain from its rank r in the FastText-based embeddings provided by Grave et al. (2018): n(r) = k/r with k = 0.1, following Manning and Schütze (1999). Finally, we group the words into 20 bins as in Mikolov et al. (2011) and represent their frequency with a 20-dimensional one-hot vector.
-Word Shape: Word shapes capture certain linguistic features and are often part of manually designed feature sets, e.g., for CRF classifiers (Lafferty et al., 2001). For example, uncommon word shapes can be indicators for domain-specific words, which can benefit from domain-specific embeddings. We create 12 binary features that capture information on the word shape, including whether the first, any or all characters are uppercased, alphanumerical, digits or punctuation marks.
-Word Shape Embeddings: In addition, we train word shape embeddings (25 dimensions) similar to Limsopatham and Collier (2016). For this, the shape of each word is converted by replacing letters with c or C (depending on the capitalization), digits with n and punctuation marks with p. For instance, Dec. 12th would be converted to Cccp nncc. The resulting shapes are one-hot encoded and a trainable randomly initialized linear layer is used to compute the shape representation.
All sparse feature vectors (binary or one-hot encoded) are fed through a linear layer to generate a dense representation. Finally, all features are concatenated into a single feature vector f of 77 dimensions which is used in the attention function as described earlier.

Adversarial Learning of Mappings
The attention-based meta-embeddings require that all embeddings have the same dimension for summation. For this, mapping matrices need to be learned, as only a limited number of embeddings exist for many languages and domains, and there is typically no option to only use embeddings of the same size. To learn an effective mapping, we propose to use adversarial training. In particular, FAME adapts gradient-reversal training with three components: the representation module R consist- ing of the different embedding models and the mapping functions Q to the common embedding space, a discriminator D that tries to distinguish the different embeddings from each other, and a downstream classifier C which is either a sequence tagger or a sentence classifier in our experiments (and is described in more detail in Section 4). The input representation is shared between the discriminator and downstream classifier and trained with gradient reversal to fool the discriminator. To be more specific, the discriminator D is a multinomial non-linear classification model with a standard cross-entropy loss function L D . In our sequence tagging experiments, the downstream classifier C has a conditional random field (CRF) output layer and is trained with a CRF loss L C to maximize the log probability of the correct tag sequence (Lample et al., 2016). In our sentence classification experiments, C is a multinomial classifier with crossentropy loss L C . Let θ R , θ D , θ C be the parameters of the representation module, discriminator and downstream classifier, respectively. Gradient reversal training will update the parameters as follows: with η being the learning rate and λ being a hyperparameter to control the discriminator influence.

Neural Architectures
In this section, we present the architectures we use for text classification and sequence tagging. Note that our contribution concerns the input representation layer, which can be used with any NLP model, e.g., also sequence-to-sequence models.

Input Layer
The input to our neural networks is our FAME metaembeddings layer as described in Section 3. Our methodology does not depend on the embedding method, i.e., it can incorporate any token representation. In our experiments, we use the embeddings listed in Table 1 (Pyysalo et al., 2013) and Spanish (Gutirrez-Fandio et al., 2021)  For all experiments, our baselines and proposed models use the same set of embeddings. We experiment with both freezing and fine-tuning the transformer embeddings during training. However, note that fine-tuning the transformer model increases the model size by more than a factor of 100 from 4M trainable parameters to 535M as shown in Table 2. This increases computational costs by a large margin. For example, in our experiments, the time for training a single epoch for English NER increases from 3 to 38 minutes.

Model for Sequence Tagging
Our sequence tagger follows a well-known architecture (Lample et al., 2016) with a bidirectional long short-term memory (BiLSTM) network and conditional random field (CRF) output layer (Lafferty et al., 2001). Note that we perform sequence tagging on sentence level without cross-sentence context as done, i.a., by Schweter and Akbik (2020).

Models for Text Classification
For sentence classification, we use a BiLSTM sentence encoder. The resulting sentence representa-  tion is fed into a linear layer followed by a softmax activation that outputs label probabilities.

Hyperparameters and Training
To ensure reproducibility, we describe details of our models and training procedure in the following.
Hyperparameters. We use hidden sizes of 256 units per direction for all BiLSTMs. The attention layer has a hidden size H of 10. We set the mapping size E to the size of the largest embedding in all experiments, i.e., 1024 dimensions, the size of XLM-R embeddings. The discriminator D has a hidden size of 1024 units and is trained every 10 th batch. We perform a hyperparameter search for the λ parameter in {1e-4, 1e-5, 1e-6, 1e-7} for models using adversarial training. Note that we use the same hyperparameters for all models and all tasks.
Training. We use the AdamW optimizer with an initial learning rate of 5e-6. We train the models for a maximum of 100 epochs and select the best model according to the performance using the task's metric on the development set if available, or using the training loss otherwise. The training was performed on Nvidia Tesla V100 GPUs with 32GB VRAM. 3

Experiments and Results
We now describe the tasks and datasets we use in our experiments as well as our results.

Tasks and Datasets
Sequence Labeling. For sequence labeling, we use named entity recognition (NER) and part-ofspeech tagging (  benchmark datasets from the news domain (English/German/Dutch/Spanish) (Tjong Kim Sang, 2002;Tjong Kim Sang and De Meulder, 2003). In addition, we conduct experiments for concept extraction on two datasets from the clinical domain, the English i2b2 2010 data (Uzuner et al., 2011)

Evaluation Results
We now present the results of our experiments. All reported numbers are the averages of three runs.
Sequence Labeling. Tables 3 and 4 show the results for sequence labeling in comparison to the state of the art. 4 Our models consistently set the new state of the art for English and Dutch NER, for domain-specific concept extraction as well as for all 27 languages for POS tagging. The comparison with XML-R on NER shows that our FAME method can also improve upon already powerful transformer representations. In domain-specific concept extraction, we outperform related work by 1.5 F 1 -points on average. This shows that our approach works across languages and domains.
Sentence Classification. Similar to sequence labeling, our FAME approach outperforms the existing machine learning models on all three tested sentence classification datasets as shown in Table 6. This demonstrates that our approach is generally  applicable and can be used for different tasks beyond the token level. 5

Analysis
We finally analyze the different components of our proposed FAME model by investigating, i.a., ablation studies, attention weights and low-resource settings.  Table 5: Ablation study results for sequence labeling. We underline our FAME models without fine-tuning for which we found statistically significant differences to the attention-based meta-embeddings (ATT).   Table 5 provides an ablation study on the different components of our FAME model for exemplary sequence-labeling tasks. First, we ablate the fine-tuning of the embedding models as we found that this has a large impact on the number of parameters of our models (538M vs. 4M) and, as a result, on the training time (cf., Section 4.1). Our results show that fine-tuning does have a positive impact on the performance of our models but our approach still works very well with frozen embeddings. In particular, our non-finetuned FAME model is competitive to a finetuned XLM-R model (see Table 3) and outperforms it on 3 out of 4 languages for NER.

Ablation Study on Model Components
Second, we ablate our two newly introduced components (features and adversarial training) and find that both of them have a positive impact on the performance of our models across tasks, languages and domains.
With successively removing components, we obtain models that actually correspond to baseline meta-embeddings as shown in the second column of the table. Our method without features and adversarial training, for example, corresponds to the baseline attention-based meta-embedding approach (ATT). Further removing the attention function yields average-based meta-embeddings (AVG). Finally, we also evaluate another baseline meta-embedding alternative, namely concatenation (CAT). Note that concatenation leads to a very highdimensional input representation and, therefore, re-quires more parameters in the next neural network layer, which can be inefficient in practice.
Statistical Significance. To show that FAME significantly improves upon the attention-based meta-embeddings, we report statistical significance 6 between those two models (using our method without fine-tuning for a fair comparison). Table 5 shows that we find statistically significant differences in six out of ten settings.

Influence of Embedding Granularities and Dimensions
Next, we perform an analysis to show the effects of our method for embeddings of different dimensions and granularities and support our motivation that our contributions help in those settings. As a testbed, we perform Spanish concept extraction and utilize the embeddings published by Grave et al. (2018) and Gutirrez-Fandio et al. (2021) as they allow us to nicely isolate the desired effects. In particular, they published pairs of embeddings (all having 300 dimensions) that were trained on the same corpora. The first embeddings are standard word embeddings and the second embeddings are subword embeddings with out-of-vocabulary functionality. As both were trained on the same data, we can isolate the effect of embedding granularities in a first experiment. In addition, Gutirrez-Fandio et al. (2021) published smaller versions with 100 dimensions that were trained under the same conditions. We use those in a second experiment to analyze the effects of combining embeddings of different dimensions.
The results are shown in Table 7. We find that adversarial training becomes particularly important whenever differently-sized embeddings are combined, i.e., when the model needs to learn mappings to higher dimensions.