MA-BERT: Learning Representation by Incorporating Multi-Attribute Knowledge in Transformers

Incorporating attribute information such as user and product features into deep neural networks has been shown to be useful in sentiment analysis. Previous works typically accomplished this in two ways: concatenating multiple attributes to word/text representation or treating them as a bias to adjust attention distribution. To leverage the advantages of both methods, this paper proposes a multi-attribute BERT (MA-BERT) to incorporate external attribute knowledge. The proposed method has two advantages. First, it applies multi-attribute transformer (MA-Transformer) encoders to incorporate multiple attributes into both input representation and attention distribution. Second, the MA-Transformer is implemented as a universal layer and stacked on a BERT-based model such that it can be initialized from a pre-trained checkpoint and ﬁne-tuned for the downstream applications without extra pre-training costs. Experiments on three benchmark datasets show that the proposed method outperformed pre-trained BERT models and other methods incorporating external attribute knowledge.


Introduction
To learn a distributed text representation for sentiment classification (Pang and Lee, 2008;Liu, 2012), conventional deep neural networks, such as convolutional neural networks (CNN) (Kim, 2014) and long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997), and common integration technics, such as self-attention mechanisms (Vaswani et al., 2017;Chaudhari et al., 2019) and dynamic routing algorithms (Gong et al., 2018;Sabour et al., 2017), are usually applied to compose the vectors of constituent words. To further enhance the performance, pre-trained models (PTMs), such as BERT (Devlin et al., 2019), ALBERT (Lan et al., 2019), RoBERTa (Liu et al., 2019), and XLM- RoBERTa (Conneau et al., 2019) can be fine-tuned and transferred for sentiment analysis tasks. Practically, PTMs were first fed a large amount of unannotated data, and trained using a masked language model or next sentence prediction to learn the usage of various words and how the language is written in general. Then, the models are transferred to another task to be fed another smaller task-specific dataset.
The abovementioned methods only use features from plain texts. Incorporating attribute information such as users and products can improve sentiment analysis task performance. Previous works typically incorporated such external knowledge by concatenating these attributes into word and text representations (Tang et al., 2015), as shown in Figs. 1(a) and (b). Such methods are often introduced in shallow models to attach attribute information to modify the representation of either words or texts. However, this may lack interaction between attributes and the text since it equally aligns words to attribute features, thus the model is unable to emphasize important tokens. Several works have used attribute features as a bias term in selfattention mechanisms to model meaningful rela-tions between words and attributes (Wu et al., 2018;Chen et al., 2016b;Dong et al., 2017;Dou, 2017), as shown in Fig. 1(c). By using the sof tmax function for normalization to calculate the attention score, the incorporated attribute features only impact the allocation of the attention weights. As a result, the representation of input words has not been updated, and the information of these attributes will be lost. For example, depending on individual preferences for chili, readers may focus on reviews talking about spicy, but only those who like chili would consider such review recommendations useful. However, current self-attention models that learn text representations by adjusting the weights of spicy may still produce the same word representation of spicy for different persons, leading to confusion in distinguishing people who like chili or not.
To address the above problems, this study proposes a multi-attribute BERT (MA-BERT) model which applies multi-attribute transformer (MA-Transformer) encoders to incorporate external attribute knowledge. Different from being incorporated into the attention mechanism as bias terms, multiple attributes can be injected into both attention maps and input token representations using bilinear interaction, as shown in Fig. 1(d). In addition, the MA-Transformer is implemented as a universal layer and stacked on a BERT-based model such that it can be initialized from a pre-training checkpoint and fine-tuned for downstream tasks without extra pre-training costs. Experiments are conducted on three benchmark datasets (IMDB, Yelp-2013, andYelp-2014) for sentiment polarity classification. The results show that the proposed MA-BERT model outperformed pre-trained BERT models and other methods incorporating external attribute knowledge.
The remainder of this paper is organized as follows. Section 2 provides a detailed description of the proposed methods. The empirical experiments are reported with analysis in Section 3. Conclusions are finally drawn in Section 4.

BERT
Single text s i

MA-Transformer Encoders
Sample Tuples Figure 2: Overall architecture of the MA-BERT model.

BERT Encoder
By applying a word piece tokenizer (Wu et al., 2016), the input text can be denoted as a sequence of tokens, i.e., s = {w 0 , w 1 , w 2 , . . . , w L−1 }, where L is the length of the text and w 0 = [CLS] is a special classification token. Moreover, its corresponding attributes are denoted as where M is the number of attributes in the text. Thus, the i-th input sample can be denoted as a tuple, i.e., (A i , s i ).
To learn the hidden representation, the pretrained language model BERT (Devlin et al., 2019) was used, achieving impressive performance for various natural language processing (NLP) tasks. We then fed the token sequence into the BERT model to obtain the representation, denoted as, where T ∈ R L×dt is the output representation of all tokens; θ BERT is the trainable parameters of BERT, which is initialized from a pretrained checkpoint and then fine-tuned during the model training; d t =768 is the dimensionality of the output representation.
According to Wu et al. (2018) and Wang et al. (2017), all the attributes are mapped to attribute where U m is the attention from m-th attribute; W o ∈ R (M ·d)×dt is the output linear projection and d denotes the dimensionality of Q, K and V ; Q, K and V are matrices that package the queries, keys and values, which are defined as, where Q m , K m and V m ∈ R L×d E are bilinear transformations (Huang et al., 2019) applied on the input representation T and attribute representation E A,m . W q,m , W k,m and W v,m ∈ R dt×d E are weight matrices for query, key and value projections, and · and respectively denote the inner and the Hadamard product. Similar to Vaswani et al. (2017), we also introduced multi-head mechanism for MA-Transformer, denoted as, where K is the number of heads for each attribute and ⊕ denotes the concatenation operator; E k A,m ∈ R d E is the m-th attribute representation in the k-th head, and its dimensionality should be ensured that d E = d t /K. Given that different heads can capture different relation types along with text representations, different parameters are considered for different heads.

MA-Transformer
Taking the representation of both text T and attribute A as input, an MA-Transformer encoder then processes the same as a standard transformer encoder (Vaswani et al., 2017) to generate Y ∈ R L×dt . Then, Y is connected by a normalization layer and a residual layer from the input representation T . The intermediate output is then passed to a two-layered feed-forward network with a rectified linear unit (ReLU) activate function. Similarly, residual and normalization layers are connected to generate the final output which is taken as the input for the next encoder.
By stacking several MA-Transformer encoders on the BERT model, the MA-BERT model generates a review representation h [CLS] consistent with the special token [CLS]. Then, a classifier comprised of a linear projection and a sof tmax activation (with the dimension identical to the number of classes) is used for classification.

Comparative Experiments
Datasets. Following the experimental settings used in Tang et al. (2015), the proposed MA-BERT model is evaluated using three benchmark datasets 1 (IMDB, Yelp-2013, andYelp-2014). The evaluation metrics include accuracy (Acc.) and root mean squared error (RM SE). Higher Acc. and lower RM SE scores indicate higher performance. Implementation Details. The baseline methods can be divided into three groups. The first group includes the methods without user and product information such as CNN (Kim, 2014), BiL-STM (Hochreiter and Schmidhuber, 1997), neural sentiment classification (NSC) (Chen et al., 2016a) and its variant with a local attention mechanism (NSC+LA). For the BERT-based methods, the uncased-base-BERT model consisting of 12 layers of transformer encoders was implemented for comparison. ToBERT (Pappagari et al., 2019) was trained non-end2end using a word-to-segment strategy in a two-stage way.
The second group includes existing methods incorporating user and product information such as NSC with user (U) and product (P) information incorporated into an attention (A) mechanism (NSC+UPA), user product neural network (UPNN) (Tang et al., 2015), hierarchical model with separate user attention and product attention (HUAPA) (Wu et al., 2018), and the chunkwise importance matrix model (CHIM) (Amplayo, 2019).
The third group includes a set of BERT-based methods incorporating user and product information using different strategies, presented in Figs. 1(a)-(c). In detail, an uncased-base-BERT model first extracted fixed feature vectors from texts. Then, the BERT Concat (word) model incorporates attribute features into each word vector and stacks another 6 transformer encoders as the feature extractor. Similarly, the BERT Concat (text) incorporates attribute features into the representation of the special token [CLS] for the classification. Finally, the BERT Attention (bias) applied 6 more MA-Transformers which only inject attributes into Q and K to calculate attention score instead of V in Eq. (6).
The proposed MA-BERT models applied 6 MA-Transformer encoders to incorporate user and product attributes, and stacking over the BERT model.   (Loshchilov and Hutter, 2017) optimizer was used with a base learning rate of 2e-5 in a warmup linear schedule. Early stopping (Prechelt, 1998) strategy with a patience of 3 epochs was also applied to avoid overfitting. The code for this paper is available at: https://github.com/yoyo-yun/

MA-Bert.
Comparative Results and Discussion. Table 1 shows the comparative results of different methods for sentiment ordinal classification. For models without user and product attributes, BiLSTM outperforms CNN (UPNN w/o UP), due to its ability to encode text. Furthermore, both NSC and NSC+LA outperformed BiLSTM mainly because of its hierarchical structure. Incorporating the user and product attributes improved performance. For example, UPNN achieved a better result than its variant without user and product attributes, i.e., CNN (UPNN w/o UP). In addition, both NSC+UPA and HUAPA introduced the user and product information as a bias to guide the hierarchical attention, and thus outperformed NSC and NSC+LA.
The proposed MA-BERT achieved the best performance on all datasets. Compared with baselines without user and product attributes, the MA-BERT can leverage implicit attribute features to boost performance. MA-BERT outperformed methods already incorporating user and product attributes (i.e., NSC+UPA, HUAPA and CHIM embedding ) because the proposed model can incorporate attribute knowledge to both the attention map and input representation.
The BERT and ToBERT models achieved improvement on all datasets against the conventional models, due to the large knowledge migration from pre-training. Unfortunately, a lack of implicit extra features resulted in performance lower than that of the proposed MA-BERT model. MA-BERT also outperformed BERT Concat (word), BERT Concat (text) and BERT Attention (bias), indicating that the proposed MA-Transformer architecture can improve existing incorporation strategies. Furthermore, the proposed MA-BERT could be initialized from the pre-trained checkpoint of BERT, thus making full use of the parameter settings without bringing additional costs for pre-training.

Conclusion
This paper proposes a MA-BERT model capable of incorporating multiple attributes into BERT-based PTMs for learning attribute-specific text representation. Different from existing attention models, the MA-Transformer can inject external knowledge to both attention maps and the input representation.Additionally, the proposed model could be extended to other tasks by using the MA-Transformer encoder as a universal layer and stacking it on a BERT-based model. Future work will attempt to incorporate such or similar multiple attributes into PTMs in the pre-training phases.