GiBERT: Enhancing BERT with Linguistic Information using a Lightweight Gated Injection Method

Large pre-trained language models such as BERT have been the driving force behind recent improvements across many NLP tasks. However, BERT is only trained to predict missing words – either through masking or next sentence prediction – and has no knowledge of lexical, syntactic or semantic information beyond what it picks up through un-supervised pre-training. We propose a novel method to explicitly inject linguistic information in the form of word embeddings into any layer of a pre-trained BERT. When injecting counter-ﬁtted and dependency-based embeddings, the performance improvements on multiple semantic similarity datasets indicate that such information is beneﬁcial and cur-rently missing from the original model. Our qualitative analysis shows that counter-ﬁtted embedding injection is particularly beneﬁcial, with notable improvements on examples that require synonym resolution.


Introduction
Detecting the semantic similarity between a given text pair is at the core of many NLP tasks. It is a challenging problem due to the inherent variability of language and the limitations of surface form similarity. Recent pre-trained language models such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019) have led to noticeable improvements in semantic similarity detection and subsequent work has explored how these architectures can be further improved. One line of work aims at model compression, making BERT smaller and accessible while mostly preserving its performance (Xu et al., 2020;Goyal et al., 2020;Sanh et al., 2019;Aguilar et al., 2020;Lan et al., 2020;Chen et al., 2020).
Other studies seek to further improve model performance by enhancing BERT with external information from knowledge bases (Peters et al., 2019;Wang et al., 2020) or additional modalities (Lu et al., 2019;Lin et al., 2020).
Before the rise of contextualised models, transfer of pre-trained information between datasets and tasks in NLP was based on word embeddings. Over many years, substantial effort was placed into the creation of such embeddings. While originally capturing mainly collocation patterns (Mikolov et al., 2013;Pennington et al., 2014), subsequent work enriched these embeddings with additional information, such as dependencies (Levy and Goldberg, 2014), subword information (Luong et al., 2013;Bojanowski et al., 2017) and semantic lexicons (Faruqui et al., 2015). As a result, there exists a wealth of pre-trained embedding resources for many languages in a unified format which could provide complementary information for contemporary pre-trained contextual models. Moreover, aligning contextual embeddings with static embeddings has shown to increase the performance of the former (Liu et al., 2020).
We propose a new method for injecting pretrained linguistically-enriched embeddings into any layer of BERT. The model maps any word embeddings into the same space as BERT's hidden representations, then combines them using learned gating parameters. Evaluation of this method on five semantic similarity tasks shows that injecting pre-trained dependency-based and counter-fitted embeddings can further enhance BERT's performance. More specifically, we make the following contributions: 1. We propose GiBERT -a lightweight gated method for injecting externally pre-trained embeddings into BERT (section 3.1). 1 2. We provide an ablation study and a detailed analysis of the components in the injection architecture (section 5).
3. We demonstrate that the proposed model improves BERT's performance on multiple se-mantic similarity detection datasets. In comparison to multi-head attention injection, our gated injection method uses fewer parameters while achieving comparable performance for dependency embeddings and improved results for counter-fitted embeddings (section 5).

Our qualitative analysis provides insights into
GiBERT's improved performance, such as in cases of sentence pairs involving synonyms. (section 5).
2 Related work BERT modifications Due to BERT's widespread success in NLP, recent studies have focused on further improving BERT by introducing external information. Such work covers a variety of application areas and technical approaches.We broadly categorise such approaches into input-related, external and internal. Input modifications (Zhao et al., 2020;Singh et al., 2020;Lai et al., 2020;Ruan et al., 2020) adapt the information that is fed to BERT -e.g. feeding text triples separated by [SEP] tokens instead of sentence pairs as in Lai et al. (2020) -while leaving the architecture unchanged. Output modifications (Xuan et al., 2020;Zhang et al., 2020) build on BERT's pre-trained representation by adding external information after the encoding step -e.g. combining it with additional semantic information as in Zhang et al. (2020) -without changing BERT itself. By contrast, internal modifications introduce new information directly into BERT by adapting its internal architecture. Fewer studies have taken this approach as this is technically more difficult and might increase the risk of so-called catastrophic forgetting -completely forgetting previous knowledge when learning new tasks (French, 1999;Wen et al., 2018). However, such modifications also offer the opportunity to directly harness BERT's powerful architecture to process the external information alongside the pretrained one. Most existing work on internal modifications has attempted to combine BERT's internal representation with visual and knowledge base information: Lu et al. (2019) modified BERT's transformer block with co-attention to integrate visual and textual information, while Lin et al. (2020) introduced a multimodal model which uses multi-head attention to integrate encoded image and text information between each transformer block. Peters et al. (2019) suggested a word-to-entity attention mechanism to incorporate external knowledge into BERT and Wang et al. (2020) proposed to inject factual and linguistic knowledge through separate adapter modules. Our method introduces external information with an addition-based mechanism which uses fewer parameters than existing attention-based techniques (Lu et al., 2019;Lin et al., 2020;Peters et al., 2019). We further incorporate a gating mechanism to scale injected information so as to reduce the risk of catastrophic forgetting. Moreover, our work investigates the injection of pretrained word embeddings, rather than multimodal or knowledge base information as in previous studies.
Semantic similarity detection Semantic similarity detection is a framework for binary text pair classification tasks such as paraphrase detection, duplicate question identification and answer sentence selection which require detecting the semantic similarity between text pairs (Peinelt et al., 2020). Early semantic similarity methods used feature-engineering techniques, exploring various syntactic (Filice et al., 2017), semantic (Balchev et al., 2016) and lexical features (Tran et al., 2015;Almarwani and Diab, 2017). Subsequent work tried to model text pair relationships either based on increasingly complex neural architectures (Deriu and Cieliebak, 2017;Wang et al., 2017;Tan et al., 2018) or by combining both approaches through hybrid techniques (Wu et al., 2017a;Feng et al., 2017;Koreeda et al., 2017). Most recently, contextual models such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019) have reached stateof-the-art performance through pretraining large context-aware language models on vast amounts of textual data. Our study joins up earlier lines of work with current state-of-the-art contextual representations by investigating the combination of BERT with dependency-based and counter-fitted embeddings.

Architecture
We propose GiBERT -a Gated Injection Method for BERT. Our model (Figure 1) is designed with semantic similarity detection in mind and comprises the following: obtaining BERT's intermediate representation from Transformer block i (step 1-2 in Figure 1), creating an alternative input representation based on linguistically-enriched word embeddings (step 3-4), combining both representations (steps 5-7) and passing on the injected information to subsequent BERT layers to make a final prediction (steps 8-9).
BERT representation We encode a sentence pair with a pre-trained BERT model (Devlin et al. 2019) and obtain BERT's internal representation at different layers (see section 5 for injection layer choices). 2 Following standard practices, we process the two input sentences S 1 and S 2 with a word piece tokenizer (Wu et al., 2017b) and combine them using '[CLS]' and '[SEP]' tokens, which indicate sentence boundaries. The word pieces are then mapped to ids, resulting in a sequence of word piece ids E W = [w 1 , ..., w N ] where N indicates the number of word pieces in the sequence (step 1 in Figure 1). In the case of embedding layer injection, we use BERT's embedding layer output denoted with H 0 which results from summing the word piece embeddings E W , positional embeddings E P and segment embeddings E S (step 2): where D is the internal hidden size of BERT (D = 768 for BERT BASE ). For injecting information at later layers, we obtain BERT's internal representation H i ∈ R N ×D after transformer block i with 1 ≤ i ≤ L (step 2): where L is the number of Transformer blocks (L = 12 for BERT BASE ) and MultiheadAtt denotes multihead attention.
External embedding representation To enrich this representation, we obtain alternative representations for the tokens in S 1 and S 2 by looking up word embeddings in a pre-trained embedding matrix E ∈ R |V |×E , where |V | denotes vocabulary size and E the dimensionality of the pre-trained embeddings (step 3, section 3.2 presents details regarding our choice of pre-trained embeddings).
In order to map word embedding representations to BERT's word piece representations, an alignment function duplicates the word embedding for the corresponding number of subwords, then adds BERT's special '[CLS]' and ' [SEP]' tokens, resulting in an injection sequence I ∈ R P ×E (step 4). For example, it assigns the pre-trained embedding of the word 'prompt' to both of the corresponding word pieces 'pro' and '##mpt' (see Figure 1).

Attention injection
Multihead attention was proposed by Vaswani et al. (2017): where queries are provided by BERT's internal representation, while keys and values come from the injected embeddings. The output of the attention mechanism is then combined with the previous layer through addition.

Gated injection
We also propose an alternative method for combining external embeddings with BERT which requires only 14% of parameters used in multi-head attention (0.23M instead of 1.64M, see Appendix G). First, we add a feed-forward layer -consisting of a linear layer with weights W P ∈ R D×E and bias b P ∈ R D with a tanh activation function -to project the aligned embedding sequence to BERT's internal dimensions and squash the output values to a range between -1 and 1 (step 5):  . The input consists of a sentence pair which is processed with a word piece tokenizer (step 1) and encoded with BERT up to layer i (step 2). We obtain an alternative representation for the sentences based on pretrained word embeddings (step 3), while ensuring that external word embeddings are aligned with BERT's word pieces by repeating embeddings for tokens which have been split into several word pieces (step 4). The aligned word embedding sequence is passed through a linear and tanh layer to match BERT's embedding dimension (step 5). We apply a gating mechanism (step 6) before adding the injected information to BERT's representation from layer i (step 7). The combined representation is passed to the next layer (step 8). At the final layer, the C vector is used as the sentence pair representation, followed by a classification layer (step 9).
Then, we use a residual connection to inject the projected external information into BERT's representation from Transformer block i (see section 5 for injection at different locations) and obtain a new enriched representation H i ∈ R N ×D : However, injection values in P can range between −1 and 1, whereas values in BERT's internal representation H i usually range from −0.1 to 0.1. When external information is directly injected using an additive operation, BERT's pre-trained information can be easily overwritten by the injection, resulting in catastrophic forgetting. To address this potential pitfall, we further propose a gating mechanism which uses a gating vector g ∈ R D to scale the injected information before combining it with BERT's internal representation as follows: where denotes element-wise multiplication using broadcasting (step 6 & 7). The gating parameters are initialised with zeros and updated dur-ing training. This has the benefit of starting finetuning from representations which are equivalent to vanilla BERT and gradually introducing the injected information during fine-tuning along certain dimensions. If specific features in the external representations are not beneficial for the task, it is easy for the model to ignore them by keeping the gating parameters at zero.
Output layer The combined representation H i is then fed to BERT's next Transformer block i + 1 (step 8). At the final Transformer block L, we use the c ∈ R D vector which corresponds to the '[CLS]' token in the input and is typically used as the sentence pair representation (step 9). As proposed by Devlin et al. (2019), this is followed by a softmax classification layer (with weights W L ∈ R C×D and b L ∈ R C ) to calculate class probablilities where C indicates the number of classes. During finetuning, we train the entire model for 3 epochs with early stopping and crossentropy loss. Learning rates are tuned for each seed and dataset based on development set performance (reported in Appendix D).

Injected Embeddings
While many different embedding resources exist, here we focus on experimenting with pre-trained word representations that are beneficial for the semantic similarity detection tasks and also contain information complementary to BERT. Embeddings such as word2vec and Glove leverage cooccurrence patterns which have been shown to be also captured by BERT (Gan et al., 2020). Recent contextualised embeddings risk redundancy with BERT due to the similarity of used approaches. We reason that linguistically-enriched embeddings are most likely to be complementary to BERT, as the model has not been explicitly trained on semantic or syntactic resources and has only partial knowledge of syntax and semantics (Rogers et al., 2020). We hence experiment with injecting dependencybased (Levy and Goldberg, 2014) and counter-fitted embeddings (Mrkšić et al., 2016) into BERT, which have been found useful for semantic similarity modelling and other related tasks (Filice et al., 2017;Feng et al., 2017;Alzantot et al., 2018;Jin et al., 2020).
The 300-dim dependency-based embeddings by Levy and Goldberg (2014) extend the SkipGram embedding algorithm proposed by Mikolov et al. (2013) by replacing linear bag-of-word contexts with dependency-based contexts which are extracted from parsed English Wikipedia sentences. As BERT has not been exposed to dependencies during pretraining and previous studies have found that BERT's knowledge of syntax is only partial (Rogers et al., 2020), we reason that these embeddings could provide complementary information.
The 300-dim counter-fitted embeddings by Mrkšić et al. (2016) integrate antonymy and synonymy relations into word embeddings based on an objective function which combines three principles: repelling antonyms, attracting synonyms and preserving the vector space. For training, they obtain synonym and antonym pairs from the Paraphrase Database and WordNet, demonstrating an increased performance on SimLex-999 (Hill et al., 2015). We use their highest-scoring vectors which were obtained by applying the counterfitting method to Paragram vectors from Wieting et al. (2015). Antonym and synonym relations are particularly important for paraphrase detection and injecting them into BERT gives the model access to this useful additional information.

Datasets and Tasks
We focus on semantic similarity detection which is a fundamental problem in NLP and involves modelling the semantic relationship between two sentences in a binary classification setup. We work with the following five widely used English language datasets which cover a range of sizes and tasks (including paraphrase detection, duplicate question identification and answer sentence selection, see Appendix A for details).
MSRP The Microsoft Research Paraphrase dataset (MSRP) contains 5K pairs of sentences from news websites which were collected based on heuristics and an SVM classifier. Gold labels are based on human binary annotations for sentential paraphrase detection (Dolan and Brockett, 2005).
SemEval The SemEval 2017 CQA dataset (Nakov et al., 2017) consists of three subtasks involving posts from the online forum Qatar Living 3 . Each subtask provides an initial post as well as 10 posts which were retrieved by a search engine and annotated with binary labels by humans. The task requires the distinction between relevant and non-relevant posts. The original problem is a ranking setting, but since the gold labels are binary, we focus on a classification setup. In subtask A, the posts are questions and comments from the same thread, in an answer sentence selection setup (26K instances). Subtask B is question paraphrase detection (4K instances). Subtask C is similar to A but comments were retrieved from an external thread (47K). We use the 2016 test set as the dev set and the 2017 test set as the test set.
Quora The Quora duplicate questions dataset is the largest of the selected datasets, consisting of more than 400K question pairs with binary labels. 4 The task is to predict whether two questions are duplicates. We use Wang et al. (2017)'s train/dev/test set partition.
All of the above datasets provide two short texts, each usually a single sentence but sometimes consisting of multiple sentences. For simplicity, we refer to each short text as 'sentence'. We frame the task as semantic similarity detection between two sentences through binary classification.

Metrics
Our main evaluation metric is the F1 score as this is more meaningful than accuracy for datasets with imbalanced label distributions (such as SemEval C, see Appendix A). We also report performance on difficult cases using the non-obvious F1 score (Peinelt et al., 2019). This metric distinguishes obvious from non-obvious instances in a dataset based on lexical overlap and gold labels, and calculates a separate F1 score for challenging cases. This value therefore tends to be lower than the regular F1 score. Dodge et al. (2020) recently showed that early stopping and random seeds can have considerable impact on the performance of finetuned BERT models, therefore we finetune all models for 3 epochs with early stopping (based on dev F1) and report average model performance across two different seeds. Hyperparameter settings of all BERT-based models are identical, except for learning rate and injection location which are tuned with grid search, see Appendix D.
BERT Following standard practice, we encode the sentence pair with BERT's C vector from the final layer, followed by a softmax layer as proposed by Devlin et al. (2019). We use Tensorflow Hub's distribution of BERT BASE .
SemBERT Additionally we compare with the semantics-aware BERT model (SemBERT, Zhang et al. 2020) which uses a semantic role labeler. As the original paper reports results on different dataset versions, we ran the official code on our datasets. The longer sentences in SemEval could not fit on a single GPU due to the larger model size.
tBERT We also combine embeddings with BERT using an averaging and concatenation method proposed in tBERT (Peinelt et al., 2020). Instead of the word topics in the original system, we use pretrained counter-fitted and dependency embeddings for direct comparison with our methods.
AiBERT We further provide an alternative Attention-based embedding Injection method for BERT based on the multihead attention injection mechanism described in equations 3 to 4. Following the same procedure as GiBERT, we tune the injection location (see Appendix E).

Results
Full model GiBERT with counter-fitted embeddings outperforms all other systems in both average F1 and average non-obvious F1 score (see Table 1). This shows that the model improves on challenging dataset instances, rather than merely leveraging shallow surface patterns. It is worth noting that GiBERT has the fewest parameters of all BERTenhancing models (see Appendix F) and doesn't require any additional preprocessing tools (such as the neural SRL tagger required by SemBERT), making it more efficient than SemBERT, tBERT and AiBERT. The largest improvement of GiBERT over BERT is observed with counter-fitted embeddings, especially on SemEval A and B (the datasets with the highest proportion of examples involving synonym pairs, see Table 5). GiBERT with dependency embeddings still improves over vanilla BERT, but gains tend to be smaller and roughly similar to the more complex AiBERT injection method. tBERT always combines external information with BERT at the latest possible stage, which is beneficial for dependency embeddings but less effective for counter-fitted embeddings (compare Table 3). Compared to GiBERT, this makes it a less flexible method while also requiring more parameters.
Our results indicate that semantic information is more important for the tasks at hand and syntactic information benefits from a late integration.
Gating mechanism Catastrophic forgetting is a potential problem when introducing external information into a pre-trained model as the injected information could disturb or completely overwrite existing knowledge (Wang et al., 2020). In our proposed model, a gating mechanism is used to scale injected embeddings before adding them to the pre-trained internal BERT representation (see section 3.1). To understand the importance of this mechanism, in Table 2 we contrast development set performance for injecting information after the embedding layer with gating -as defined in equation 7 -and without -as in equation 6. For dependency embedding injection without gating, performance only improves on 2 out of 5 datasets over the base-   line and in some cases even drops below BERT's performance, while it outperforms the baseline on all datasets when using the gating mechanism.
Counter-fitted embedding injection without gating improves on 4 out of 5 datasets, with further improvements when adding gating, outperforming the vanilla BERT model across all datasets. In addition, gating makes model training more stable and reduces failed runs (where the model predicted only the majority class) on the particularly imbalanced SemEval C dataset. This highlights the importance of the gating mechanism in our proposed method.
Injection location In our proposed model, information can be injected between any of BERT's pre-trained transformer blocks. We reason that different locations may be more appropriate for certain kinds of embeddings as previous research has found that different types of information tend to be encoded and processed at specific BERT layers (Rogers et al., 2020). We experiment with three locations: after the embedding layer (using H 0 ), after the middle layer (using H 6 in BERT BASE ) and after the penultimate layer (using H 11 in BERT BASE ). Table 3 shows that midlayer injection is ideal for counter-fitted embeddings, while late injection appears to work best for dependency embeddings (Table 3). This is in line with previous work which found that BERT tends to processes syntactic information at later layers than linear word-level information (Rogers et al., 2020). We consequently use these injection locations in our final model (see Appendix E for AiBERT's tuned injection locations).
Error Analysis Counter-fitted embeddings are designed to explicitly encode synonym and antonym relationships between words. To better understand how the injection of counter-fitted embeddings affects the ability of our model to deal  with instances involving such semantic relations, we use synonym and antonym pairs from the PPDB and Wordnet (provided by Mrkšić et al. 2016) and search the development partition of the datasets for sentence pairs where the first sentence contains one word of the synonym/antonym pair and the second sentence the other word. Table 5 reports F1 performance of our model on cases with synonym pairs, antonym pairs and neither one. We find that our model's F1 performance particularly improves over BERT on instances containing synonym pairs, as illustrated in example (1) in Table 4. By contrast, the performance on cases with antonym pairs stays roughly the same, although slightly decreasing on Quora. As illustrated by example (2) in Table 4, word pairs can be antonyms in isolation (e.g. husband -wife), but not in the specific context of a given example. In rare cases, the injection of distant antonym pair embeddings can therefore  deter the model from detecting related sentence pairs. We also observe a slight performance boost for cases without synonym or antonym pairs which could be due to improved representations for words which occurred in examples without their synonym or antonym counterpart.

Conclusion
In this paper, we introduced a new approach for injecting external information into BERT. Our proposed method adds linguistically enriched embeddings to BERT's internal representation through a lightweight gating mechanism which requires fewer parameters than previous approaches. Evaluating our injection method on multiple semantic similarity detection datasets, we demonstrated that injecting counter-fitted embeddings clearly improved performance over vanilla BERT and on average outperformed all baselines on the task, while dependency embedding injection achieved slightly smaller gains. In comparison to the multihead attention injection mechanism, we found the gated method at least as effective, with comparable performance for dependency embeddings and improved results for counter-fitted embeddings. In ablation studies, we showed that the choice of injection location and the use of the proposed gating mechanism are crucial for our architecture. Our qualitative analysis highlighted that counter-fitted injection was particularly helpful for instances involving synonym pairs. Future work could explore combining multiple embedding sources or injecting other types of information. Another direction is to investigate the usefulness of embedding injection for other tasks.

C Preprocessing
We lowercase and tokenise all datasets, replacing images and URLs with placeholders. Sequences exceeding the maximum length are cut off. Each model is trained on a single NVIDIA Tesla K80 GPU.

E Injection Location for AiBERT
Based on the development set results shown in Ta

G Required Injection Parameters
This section compares the number of required parameters in the two alternative injection methods discussed in section 4.1: a multihead attention injection mechanism (AiBERT) and a novel lightweight gated injection mechanism (GiBERT).

Attention injection
In multihead attention injection (equations 3 to 4), the keys are provided by BERT's representation from the injection layer H i and the queries are the injected information I. Multihead attention requires the following weight matrices W and biases b to transform queries, keys and values (indicated by Q, K and V ) and transform the attention output (indicated by O): where D indicates BERT's hidden dimension and E indicates the dimensionality of the injected embeddings. When injecting embeddings with E = 300 into BERT BASE with D = 768, this amounts to ≈ 1.64M new parameters.

Gated injection
The proposed gated injection method (equations 6 to 7) only introduces the weights and biases from the projection layer, as well as the gating vector: params(GatedInjection) =params(W P , b P , g) =D(E + 1) + E.
Gated injection of embeddings with E = 300 into BERT BASE requires ≈ 0.23M new parameters.
Therefore, the proposed gated injection mechanism only requires 14% of the parameters used in a multihead attention injection mechanism. Using fewer parameters results in a smaller model which is especially beneficial for injecting information during finetuning, where small learning rates and few epochs make it difficult to learn large amounts of new parameters.

H Gating Parameter Analysis
As described in section 4.1, the gating parameters g in our proposed model are initialised as a vector of zeros. During training, the model can learn to gradually inject external information by adjusting gating parameters to > 0 for adding, or < 0 for subtracting injected information along certain dimensions. Alternatively, injection stays turned off if all parameters remain at zero. Figure 2 shows a histogram of learned gating vectors for our best GiBERT models with counter-fitted (left) and dependency embedding injection (right). On most datasets, the majority of parameters have been updated to small non-zero values, letting through controlled amounts of injected information without completely overwriting BERT's internal representation. Only on Semeval B (with 4K instances the smallest of the datasets, compare section 3), more than 500 of the 768 dimensions of the injected information stay blocked out for both model variants. The gating parameters also filter out many dimensions of the dependency-based embeddings on MSRP (the second smallest dataset). This suggests that models trained on smaller datasets may benefit from slightly longer finetuning or a different gating parameter initialisation to make full use of the injected information. 5