Better Neural Machine Translation by Extracting Linguistic Information from BERT

Adding linguistic information (syntax or semantics) to neural machine translation (NMT) have mostly focused on using point estimates from pre-trained models. Directly using the capacity of massive pre-trained contextual word embedding models such as BERT(Devlin et al., 2019) has been marginally useful in NMT because effective fine-tuning is difficult to obtain for NMT without making training brittle and unreliable. We augment NMT by extracting dense fine-tuned vector-based linguistic information from BERT instead of using point estimates. Experimental results show that our method of incorporating linguistic information helps NMT to generalize better in a variety of training contexts and is no more difficult to train than conventional Transformer-based NMT.


Introduction
Probing studies into large contextual word embeddings such as BERT (Devlin et al., 2019) have shown that these deep multi-layer models essentially reconstruct the traditional NLP pipeline capturing syntax and semantics (Jawahar et al., 2019); information such as part-of-speech tags, constituents, dependencies, semantic roles, coreference resolution information (Tenney et al., 2019a,b) and subject-verb agreement information can be reconstructed from BERT embeddings (Goldberg, 2019). In this work, we wish to extract the relevant pieces of linguistic information related to various levels of syntax from BERT in the form of dense vectors and then use these vectors as linguistic "experts" that neural machine translation (NMT) models can consult during translation.
But can syntax help improving NMT? Linzen et al. (2016); Kuncoro et al. (2018); Sundararaman et al. (2019) have reported that learning grammatical structure of sentences can lead to higher levels of performance in NLP models. In particular, Sennrich and Haddow (2016) show that augmenting NMT models with explicit linguistic annotations improves translation quality.
BERT embeddings have been previously considered for improving NMT models. Clinchant et al. (2019) replace the encoder token embedding layer in a Transformer NMT model with BERT contextual embeddings. They also experiment with initializing all the encoder layers of the translation model with BERT parameters, in which case they report results on both freezing and fine-tuning the encoder parameters during training. In their experiments BERT embeddings can help with noisy inputs to the NMT model, but otherwise do not help improving NMT performance. Imamura and Sumita (2019) suggest that replacing the encoder layer with BERT embeddings and fine-tuning BERT while training the decoder leads to a catastrophic forgetting phenomenon where useful information in BERT is lost due to the magnitude and number of updates necessary for training the translation decoder and fine-tuning BERT. They present a two-step optimization regime in which the first step freezes the BERT parameters and trains only the decoder while the next step finetunes the encoder (BERT) and the decoder at the same time. Yang et al. (2020) also try to address the catastrophic forgetting phenomenon by thinking of BERT as a teacher for the encoder of the neural translation model (student network) (Hinton et al., 2015). They propose a dynamic switching gate implemented as a linear combination of the encoded embeddings from BERT and the encoder of NMT. However these papers do not really focus on the linguistic information in BERT, but rather try to combine pre-trained BERT and NMT encoder representations. Sundararaman et al. (2019) identify part-ofspeech, case, and sub-word position as essential lin-guistic information to improve the quality of both BERT and the neural translation model. They extract each linguistic feature using the Viterbi output of separate models, embed the extracted linguistic information (similar to trained word embeddings) and append these vectors to the token embeddings. However, their model uses point estimates of the syntactic models and they do not use the linguistic information in BERT embeddings. Weng et al. (2019) use multiple multi-layer perceptron (MLP) modules to combine the information from different layers of BERT into the translation model. To make the most out of the fused information, they also alter the translation model training objective to contain auxiliary knowledge distillation (Hinton et al., 2015) parts concerned with the information coming from the pre-trained language model. Zhu et al. (2020) also inject BERT into all layers of the translation model rather than only input embeddings. Their model uses an attention module to dynamically control how each layer interacts with the representations. In both of these works, the training of the Transformer for NMT becomes quite brittle and is prone to diverge to local optima.
In this paper, we propose using pre-trained BERT as a source of linguistic information rather than a source of frozen pre-trained contextual embedding. We identify components of the BERT embeddings that correspond to different types of linguistic information such as part-of-speech, etc. and fine-tune dense vector embeddings for these linguistic aspects of the input and use them within an NMT model. Our approach does not radically complicate the Transformer NMT model training process both in terms of time and hardware requirements and also in terms of training difficulty (avoids bad local optima).
Our contributions are as follows: (1) A method of linguistic information extraction from BERT which needs supervision while training but works without supervision afterwards. (2) An easily trainable procedure for integrating the extracted information into the translation model. (3) Evaluation of the proposed model on small, medium and large translation datasets.
The source code and trained aspect extractors are available at https://github.com/sfunatlang/SFUTranslate and our experiments can be replicated using scripts under resources/ exp-scripts/aspect exps.

NMT and BERT
Machine translation is the problem of transforming an input utterance sequence X in source language l f into another utterance sequence Y (possibly with varying length) in target language l e . Machine translation models search among all possible sequences in target language to find the most probable sequence based on the probability distribution of Equation 1.
(1) Neural machine translation (NMT) tries to model the probability distribution p(y|X) using neural networks by taking advantage of deep learning techniques. Transformers (Vaswani et al., 2017) are one type of encoder-decoder neural networks used for translation tasks. In Transformers, the input (in one-hot format) is passed through N layers of encoder and N layers of decoder. In each layer, the layer input passes through multiple attention heads (h heads; each considered a specialist in a different sentence-level linguistic attribute) and then gets transformed to the input for the next layer using a two layer feed-forward perceptron module with input size of d model and hidden layer size of d ff . The final probability distribution p(y|X) is generated using an affine transformation applied to the output of the last feed-forward module in the N th decoder layer. Please see (Vaswani et al., 2017) for further details.
BERT (Devlin et al., 2019) adopts the encoder part of the transformer model and requires training it on large amounts of text data using a masked language model objective over sub-words p(y i |X, y 0 , ..., y i−1 , y i+1 , ..., y max len ) instead of guessing the next sub-word p(y i |X, y 0 , ..., y i−1 ). This bidirectional context turns BERT into a provider of strong contextual sub-word embeddings in many languages. These massively overparameterized neural networks have revolutionized many different NLP tasks. Effective application of BERT in NMT has been studied in a number of contemporary research projects; Language Modeling, Named Entity Recognition, Question Answering, Natural Language Inference, Text Classification (Devlin et al., 2019), and Question Generation (Chan and Fan, 2019). We approach this problem from the novel perspective of extracting linguistic information encoded in BERT and applying such information in NMT.

Linguistic Aspect Extraction from BERT
Since BERT contextual embeddings contain a variety of information (linguistic and non-linguistic), extraction of relevant information plays an important role in further improvement of the downstream tasks. In the rest of this section, we define aspect vectors as single-purpose dense vectors of extracted linguistic information from BERT, discuss how aspect vectors can be extracted, and explain how to integrate aspect vectors into NMT.

Aspect Vectors
To start the information extraction process, we initially need to choose a limited (desired) set of linguistic attributes to look for in BERT embeddings. This attribute set can contain a number of linguistic aspects (e.g. part-of-speech). Each linguistic aspect itself will be defined over a possible aspect tag set (e.g. the set of {NOUN, ADJ, ...} in partof-speech). In this paper, we show a linguistic attribute set with A, show a generic aspect with a and point to its relative tag set with t a . Given the definition of a linguistic aspect and inspired by the information bottleneck idea (Tishby and Zaslavsky, 2015), we define an aspect vector as a single-purpose dense vector extracted from BERT and containing information about a certain linguistic aspect of a particular (sub-word) token in the input sequence. Aspect vectors can be interpreted as feature values equivalent to a specific key (aspect).

Aspect Vector Extraction
For each embedding vector E and linguistic aspect a, we define M a as an aspect-extraction function where e a = M a (E) is a single-purpose dense vector containing maximum aspect information and minimum irrelevant other information.
We ensure the aspect encoding power of e a by retrieving its equivalent tag in t a using a classifier. The aspect prediction loss for a linguistic attribute set A of size n can be calculated as the average cross entropy loss (L CE ) between the classifier prediction and the expected aspect tags for each aspect (Equation 2). We also ensure information integrity 1 of e a by concatenating all the aspects (in addition to a "left-over" aspect equivalent to all the other noninteresting information) and reconstructing the original embedding vector E from them 2 in reconstruction vector R. The reconstruction loss (L r ) for the extracted aspect vectors can be calculated as the euclidean distance of the reconstruction vector R and the original embedding vector E (Equation 3).
In addition, since our aspect extractor is similar in architecture to a multi-head attention module (with a difference in the fact that we know what exactly each head will be responsible for), to prevent learning redundant representations (Michel et al., 2019), we add the average euclidean similarity (L s ) of each pair of aspect vectors to the training loss function (Equation 4).
The aspect extractor will be trained over the accumulation of the three mentioned loss components (Equation 5). Figure 1 demonstrates different parts of the aspect extractor and their connections.  Figure  As another important point, a pre-trained BERT model has multiple encoder layers as well as an embedding layer. Choosing the proper layer which contains all of our desired aspects is not simply possible since different layers specialize in different linguistic aspects (Jawahar et al., 2019;Tenney et al., 2019a).
Therefore, as Peters et al. (2018) suggest, we define BERT embedding vector E as a weighted sum of all BERT layers (of size ) using Equation 6 where α weights are learnable parameters and will be trained along with the other aspect extractor parameters.

Integrating Aspect Vectors into NMT
Once the aspect vectors are created, we throw away the classifiers and the reconstruction layers and place the encoder part of our trained aspect extractor (the mapping from BERT contextual embeddings to aspect vectors) in an input integration module designed to augment the neural translation model input with aspect vectors 3 .
The integration module (constructed using a two layer perceptron network) receives the concatenated aspect vectors (we call this concatenated vector a linguistic embedding 4 ) and the token embedding (inherited from the Transformer model), and maps the linguistic embedding into a vector of the same size as the token embedding. Then, it projects the concatenation of both embeddings to a vector with the same size as the token embedding of the original Transformer model 5 . Figure 2 demonstrates this process.

Experiments
In this section, we initially examine our designed aspect extractor and report its classification accuracy scores. Next, we integrate the extracted aspect vectors into the neural machine translation framework as explained in Section 3.3 and study the effects of integrated vectors on the performance of the models.

Data
We choose three German (which has explicit and nuanced linguistic features) to English datasets in different data sizes to examine our proposed framework.
We use Multi30k (M30k) 6 as our small dataset. This dataset contains a multilingual set of image descriptions in German, English and French. Due to this reason, we also consider experimenting on German to French as our second small dataset. The M30k data contains 29K training sentences, 1014 validation sentences (val) and 1000 test sentences (test2016).
We take IWSLT (Cettolo et al., 2012) 7 as our medium sized dataset. The sentences in this dataset are quite different from M30k since they are composed from the transcriptions of TED talks as well as dialogues and lectures 8 . The IWSLT data contains 208K training sentences, 888 validation sentences (dev2010) and multiple test sets (tst2010 to tst2015 with 1568, 1433, 1700, 993, 1305, and 1080 sentences, respectively).
For the large data size, we consider WMT 9 , a large (4.5M training sentences) set of parallel sentences from the proceedings of the European Parliament as well as web crawled news articles. We remove 0.05% of the training data (2290 sentences; lines with numbers divisible by 2000) and use it as the validation set (we call it wmt val) and take newstest data from 2014 to 2019 as our test sets (with 3003, 2169, 2999, 3004, 2998 and 1997 sentences, respectively).
We remove train data sentences longer than 100 words and uncase and normalize both side sentences using MosesPunctNormalizer 10 before tokenization. The reference side of the test data remains untouched in all the steps of our experiments.

Linguistic Aspect Vector Extraction
In this section, we study our linguistic aspect extractor training procedure and analyze the quality of the extracted aspect vectors. 6 AKA Flickr30K provided in task 1 of WMT17 multimodal machine translation, http://www.statmt.org/wmt17/ multimodal-task.html 7 2017 was the last year that the data for this task got updated; https://wit3.fbk.eu/mt.php?release=2017-01-mted-test 8 While the talks are quite polished, they still contain many verbal structures and sometimes even sounds (e.g. "Imagine an engine going clack, clack, clack, clack, clack, clack, clack."). 9 Europarl+CommonCrawl+NewsCommentary https://www.statmt.org/wmt14/translation-task.html, please note that in the later years this training set remained the same, but ParaCrawl data was added to it. We do not use ParaCrawl data since it is quite noisy and we aim to limit the effects of uncontrolled variables in our training data. However, we report our results on all the test tests after 2014. 10 https://github.com/alvations/sacremoses/ We choose our linguistic attribute set (A) as Sundararaman et al. (2019) suggest, however, we replace 'case' with 'word-shape' 11 since we believe the complete shape of the word is much more informative specially in sub-word settings. In addition, we consider a two-level hierarchy in part-ofspeech tags to benefit from both higher accuracy in exploring the syntactic search space and lower model confusion in cases where the fine-grained tags are not helpful. Therefore, we consider coarsegrained and fine-grained part-of-speech (CPOS and FPOS), word-shape (WSH), and sub-word position 12 (SWP) to form our experimental linguistic attribute set (A). Other linguistic attributes such as dependency parses or sentiment could be considered as aspects in our model but we leave that for future work.
We use the spaCy German tagger 13 model to acquire our intended linguistic aspect labels. Since spaCy is trained on word-level while BERT is trained on sub-word level, we had to align the sequences using a monotonic alignment algorithm (see Appendix A.1.1). The fine-grained part-ofspeech tagger in spaCy 14 is pre-trained on TIGER Corpus 15 (Smith et al., 2003) and inherits its 55 fine-grained tags from TIGER treebank. The coarse-grained spaCy part-of-speech tagger has been trained by defining a direct mapping from 55 tags of the TIGER treebank to the 16 tags in the Universal Dependencies v2 POS tag set 16 .
We use a 12-layer 17 German pre-trained BERT model for encoding the source sentences in aspect extractors. We use an uncased model as our translation model performs on lowercased data and the results are recased using the moses recaser so that the results are cased BLEU scores comparable to other systems 18 . We pass the BERT-encoded source sentences through a single perceptron middle layer of size 1000. We divide the output of this layer to 11 Representing capitalization (changing alphabet to x or X), punctuation, and digits (changing digits to d). As an example for word-shape, the sub-word ##arxiv. in the token 'myarxiv.org' will turn to ##xxxxx.. 12 Encoding the word with one of the three labels "Begin", "Inside", or "Single". 13 https://spacy.io/models/de 14 SpaCy reports 96.52% accuracy for this model. 15 https://www.ims.uni-stuttgart.de/ 16 https://universaldependencies.org/v2/postags.html 17 Hidden state size of 768 with 12 heads; written in PyTorch and distributed by Wolf et al. (2019). You can find model configurations in https://github.com/dbmdz/berts. 18 We recommend using a cased BERT model for translation systems that handle casing differently.  'number of aspects + 1' splits to form our desired aspect vectors (of size 200). Please see Appendix A.1.1 for more implementation details. We train three different aspect extractors, one for each dataset and feed in the source sentences of the dataset to our model in batches of size 32 for 3 epochs 19 . Table 1 shows F-1 scores of classifying the validation set data using different aspect vectors after training the aspect extractors on the train set sentences. Please note that for calculating the wordlevel scores, in cases of disagreement between different sub-word tokens, the sub-word prediction of the first sub-word token has been counted as the prediction for the word label.
We also validate our trained (on M30k, IWSLT, and WMT) aspect extractors against the manual annotations of TIGER treebank with which the spaCy fine-grained part-of-speech tagger has been trained. We train an extra aspect extractor using the train set of TIGER corpus and test all four trained aspect extractors against TIGER data test set 20 . This experiment evaluates the absolute power of our 19 Since the number of WMT sentences are much bigger, we stop training WMT aspect extractors when there is no improvement in aspect classification result (rounded to have 3 decimal places) of any label for at least 40 batches. 20 We use german tiger test gs.conll in the version of TIGER released in 2006 CoNLL Shared Task -Ten simple feed-forward aspect extractors in performing the aspect classification task. Please note that our goal in this experiment is not to achieve the state-of-the-art fine-grained part-of-speech tagging results as our aspect extractors receive their input from BERT and do not directly access the tagged input sentences. Table 2 contains the results of comparison between predictions of different aspect extractor classifiers and TIGER gold labels.

Uniqueness of Information in Linguistic Aspect Vectors
Considering the high F-scores for each aspect category in each dataset (Table 1), we can conclude that our aspect extractor maximizes the relevant information extraction from BERT embeddings. The loss in Equation 4 maximizes the distance between aspect vectors. To test whether this leads to a diverse set of aspect vectors, each specialized to their own linguistic attributes, we consider each aspect category a, after training the aspect extractors. We take each of the other extracted aspect vectors a (except the "left-over" vector) and use each of them to train a new classifier 21 that predicts the right class for category a based on aspect vector a . This will test the correlation between the information in aspect vectors a and the tags in category a. If the classification scores for this counterfactual test are high then our model has failed in fine-tuning each aspect vector to predict a particular linguistic aspect. We compare the classification scores to a trivial baseline: predict the most frequent class always. Table 3 shows the results of this counterfactual test on the aspect extractor trained on TIGER data. We can see that the average F-1 scores are very low when we use counterfactual aspect vectors to predict a linguistic aspect on which it was not fine-tuned (e.g. use aspect vector trained on part-  Table 3: Classification scores of each aspect classifier when fed with other extracted aspect vectors. We expect the F-1 scores to be low so we can conclude that our aspect extractor truly excludes irrelevant information from each aspect. of-speech to predict word shape). This shows that our training method fine-tunes each aspect vector to its linguistic task.
To validate the loss in Equation 3, we calculate the average euclidean distance of the aspect extractor reconstructed vectors and the original BERT embedding vectors 22 for M30k German to English dataset. We unit normalize each of the vectors for a score in [0, 1]. The average euclidean distance value of 0.1863 tells us that the reconstruction component of the aspect extractor is capable of reconstructing vectors that are close to the original embedding vectors.

Linguistic Aspect Integrated Machine Translation
After confirming the adequacy and uniqueness of linguistic information in aspect vectors, we integrate the encoder part of aspect extractors into the translation model and perform translation experiments on M30k, IWSLT, and WMT datasets. In our experiments, we compare our model to three baselines : (1) the vanilla transformer model (Vaswani et al., 2017)  During each training trial, we perform 9 validation set evaluation steps (one after visiting each 10% of the data). In each step, the validation set is translated with the current state of the model (at the time of evaluation) and the generated sentences are detokenized and compared to the validation set reference data to produce sentence-level BLEU (Lin and Och, 2004) scores. The best scoring model throughout training is selected as the model with which the test set(s) are translated.
For M30k and IWSLT data sets, we train two separate models, one using the aspect vectors trained on the source side of its own training data (indomain) and the other using the aspect vectors trained on the source side of WMT data (out-ofdomain). We use cased BLEU (evaluated with the standard mteval-v14.pl script) and ME-TEOR (Denkowski and Lavie, 2014) to compare different models. Tables 4 and 7 show the results of evaluating the models trained with different mentioned settings.
The evaluation results show that taking advantage of aspect vectors improves the accuracy of translating German to both English and French in M30k as well as German to English in IWSLT and WMT. Also, in majority of the cases WMT-trained aspect vectors have pushed the model to produce more accurate results since they contain more generalized information. Based on these results, we conjecture that aspect vectors trained on large outof-domain data can be helpful in low-resource settings but we leave the examination of this idea for future work.
Aside from performance, our model is approximately 5 times faster than syntax-infused translation model (Sundararaman et al., 2019) while demanding less number of trainable parameters. Although it is not as fast as bert-freeze model (Clinchant et al., 2019) in large settings (because of the size of computations required for calculating the linguistic embedding), it is comparable in speed to bert-freeze in medium and small scale settings. Appendix A.2 contains some additional insights regarding how aspect vectors can help translation systems trained on different dataset sizes.
Tables 5 and 6 demonstrate some examples of cases where aspect vectors has been useful in improving the translation quality.  has not been added to the model size for the aspect augmented and bert-freeze models since BERT is not trained in these settings). runtime is the total time the training script has ran and includes time taken for reading the data and training the model from scratch (iterating over the instances for all the epochs).
All the baseline results are achieved using our re-implementation of the mentioned papers. * We have used a single GeForce GTX 1080 GPU for M30k experiments and a single Titan RTX GPU for IWSLT and WMT experiments. † Each experiment was repeated three times, and we report the average in this table.

Conclusion and Future Work
In this paper, we proposed a simple method of extracting linguistic information from BERT contextual embeddings and integrating them into neural machine translation framework. We showed that the linguistic aspect vectors provide the translation models with out-of-domain knowledge which not only improves the translation quality but also helps the model to better deal with out-of-vocabulary words. In the future, we would like to reconsider the integration module as a multi-head attention module, except that it will attend to different linguistic aspects of the current sub-word or sub-word tokens of a single word. Increasing the number of linguistic aspects (especially the use of syntactic dependencies and morphology) and studying the effects of the aspect vector size on the quality of generated translations are other directions of future research. We would also like to examine the effectiveness of aspect vectors trained on large out-of-domain data in low-resource settings and explore the effects of using linguistic aspect vectors in tasks other than machine translation.
Source Ihm werde weiterhin vorgeworfen, unerlaubt geheime Informationen weitergegeben zu haben. Reference He is still accused of passing on secret information without authorisation. Vaswani et al. 2017 He has also been accused of having illegally passed on secret information. Clinchant et al. 2019 He continues to be accused of fraudulently passing on secret information. Sundararaman et al. 2019 He is also accused of having pass unauthorised secret information on. Aspect Augmented NMT He is still accused of passing on illegal secret information.
Source Auto und Traktor krachen zusammen: Frau stirbt bei schrecklichem Unfall Reference Car and tractor crash together: woman dies in terrible accident Vaswani et al. 2017 Car and traktor cranes together: women die in the event of a terrible accident. Clinchant et al. 2019 Cars and tractors are killing women in the event of a terrible accident. Sundararaman et al. 2019 Auto and tractor are blowing together: woman dies when the terrible accident occurs. Aspect Augmented NMT Car and tractor crash together: woman dies in terrible accidents.    backpropagation with weights proportional to the inverse frequency of each tag. We decay learning rate with a factor of 0.9 when the loss value stops improving.

A.1.2 Linguistic Aspect Integrated Machine Translation Implementation Details
We implement our baseline transformer model using the guidelines suggested by Rush (2018) in our translation toolkit SFUTranslate and extend it for implementing the aspect-augmented model as well as the syntax-infused transformer and transformer with bert-freeze input setting. Table 8 provides the configuration settings for each of the models used in our experiments. We use the pre-trained WordPiece 25 (Schuster and Nakajima, 2012) tokenizer packaged and shipped with BERT (containing 31,102 sub-word tokens for German language) to tokenize the source side data, and tokenize the target side data with MosesTokenizer 26 followed by the same Word-Piece tokenizer model, trained on target data, to split the target tokens into sub-tokens. We set the target side WordPiece vocabulary size to 30,000 sub-words for English and French. Our models share the vocabulary and embedding modules of both source and target (Press and Wolf, 2017) since both source and target are trained in sub-word space. The shared vocabulary sizes of M30k (German to English), M30k (German to French), IWSLT, and WMT are 16645, 16074, 40807, 47940, respectively. We generate target sentences using beam search with beam size 4 and length normalization factor 25 https://github.com/huggingface/tokenizers 26 https://github.com/alvations/sacremoses (Wu et al., 2016) of 0.6. We merge the Word-Piece tokens in the generated sentences (a postprocessing step to create words) and use Moses-Detokenizer 27 to detokenize the generated outputs. We use Moses recaser 28 to produce cased translation outputs. We use mteval-v14.pl script for cased BLEU evaluation.
For all models, we set positional encoding max length to 4096, dropout to 0.1, loss prediction smoothing to 0.1, and initialize the models using Xavier initialization (Glorot and Bengio, 2010). We train all models using NoamOpt optimizer (Rush, 2018) and perform the gradient accumulation trick (Ott et al., 2018) with one update per a number of batches (Table 8; grad accumulation ) to simulate larger batch sizes on a single GPU.

A.2 Additional Analysis of Linguistic Aspect Integrated Machine Translation Results
In this section, we analyze the results of our aspect integrated translation experiments. We provide our analysis in two parts, one for small and medium sized datasets and the other for large ones. For smaller datasets (containing a few hundred thousand sentence pairs or less), the broader perspective of BERT knowledge is helpful in limiting the search space for the model. So using our technique, the translation model receives more information regarding the general use cases of (locally) rare words. Linguistic aspect vectors also help the model better understand less familiar (in comparison to what is frequent in its limited size training data) syntactic structures in input sentences. This is why we believe aspect vectors can be helpful in low-resource settings.
Improving models with large amounts of data (with several million sentence pairs) is a challenging task. The best practice in training neural translation models is to initialize the embedding module with small random values and let the model search through the parameter space to find the optimal parameter settings. Extracted aspect vectors, as an external source of monolingual knowledge on the source side, are a more reasonable starting point for large models than random initialization. Integrating aspect vectors thus helps these models find a better path towards the optimal point(s) and increases the chances of the model ending up in a more desirable point in search space.