Fusing Label Embedding into BERT: An Efficient Improvement for Text Classification

With pre-trained models, such as BERT, gaining more and more attention, plenty of research has been done to further promote their capabilities, from enhancing the experimental procedures (Sun et al., 2019) to improving the mathematical principles. In this paper, we pro-pose a concise method for improving BERT’s performance in text classiﬁcation by utilizing a label embedding technique while keeping almost the same computational cost. Experimental results on six text classiﬁcation benchmark datasets demonstrate its effectiveness.


Introduction
Text classification is a classic problem in natural language processing (NLP). The task is to annotate a predefined class or classes to a given text, where text representation is an important intermediate step.
Pre-trained models have also been greatly beneficial in text classification in that they help streamline the training process by avoiding a start from zero (Stein et al., 2019;. One group of approaches has focused on word embeddings, such as word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014); another has focused on contextualized word embeddings, from CoVe (McCann et al., 2017) to ELMo (Peters et al., 2018), OpenAI GPT (Radford et al., 2018), ULMFiT (Howard and Ruder, 2018), and BERT (Devlin et al., 2019).
BERT has achieved particularly impressive performances across a variety of NLP tasks. With its success, models pre-trained on a large amount of data, such as ERNIE (Zhang et al., 2019), RoBERTa (Liu et al., 2019), UniLM (Dong et al., 2019), and XLnet , have become popular thanks to their ability in learning contextualized representations. These models are based on the multi-layered bidirectional attention mechanism (Vaswani et al., 2017) and are trained through the masked word prediction task, which are two of the main components of BERT. Continuing to investigate the potential of BERT remains important, since the findings can help with the investigation of variants of BERT as well.
In this work, we propose a simple but effective method to improve BERT's performance in text classification. We enhance the contextual representation learning through encoding the texts of class labels (e.g. "world", "sports", "business", and "science technology" in the AGNews dataset) along with the documents, without changing the original encoder structure. Our main contributions are as follows.
• The embeddings of both texts and labels are jointly learned from the same latent space, and so no further intermediate steps are needed.
• Our implementation takes more thorough and efficient advantage of BERT's inherent selfattention for the interaction between the label embeddings and text embeddings, without introducing other mechanisms.
• Since only the original structure of BERT is required, our method barely increases the amount of computation.
• Extensive results on six benchmark datasets reveal that our method taps into the deeper potential of BERT, leading to optimism that BERT can be further improved for text classification as well as other downstream tasks.

Related Work
Apart from the pre-trained models for learning general language representations mentioned above, some studies have focused specifically on leveraging the representations of classes or the higher level global information. Examples include t- BERT (Peinelt et al., 2020), which combines topic models with BERT for pairwise semantic similarity detection, and LCM (Guo et al., 2020), which generates an enhancement distribution to the one-hot vector representing the classes by calculating the similarity between instances and labels to improve the classification performance. Moreover, the label embedding has increasingly taken a leading role in related research. It is a technique in which the contents of labels are also embedded, so that the model can be trained to deal with the label information and input features at the same time. It is proven to be effective in various domains including image classification (Akata et al., 2015), multi-modal learning between images and texts (Frome et al., 2013;Kiros et al., 2014), text recognition in images (Rodriguez-Serrano et al., 2013), and zero-shot learning (Palatucci et al., 2009;Yogatama et al., 2015;Li et al., 2015;Ma et al., 2016).
Notably, in the field of text classification,  converted the task into a vectormatching problem, while Yang et al. (2018) utilized a sequence generation framework for capturing the correlation between labels. Wang et al. (2018a) proposed the label embedding attentive model (LEAM), an attention-based framework that jointly learns the embeddings of words and labels from a shared latent space. Inspired by LEAM, Si et al. (2020) developed LESA-BERT, where label embeddings are incorporated into self-attention by modifying attention scores. Our approach differs from them in that it can consider bidirectional attention between both label and document embeddings in BERT without changing its attention process.  BERT, we concatenate texts of labels and an original document to be classified with a [SEP] token as an input, and use different segment embeddings for the label texts and the document text. The actual label texts are listed in Appendix A.
We denote the document tokens as D i and their corresponding token embeddings as E D i . Hence, D K refers to the last token of the input document, where K is the number of words in the document. Let L j be the label texts of the j-th class of the total C classes. Since L j may consist of several subwords, we calculate E L j , the embedding of L j , by averaging the token embeddings of all subwords in L j . In this way, the length of the label sentence is equal to C, and E L j can be encoded together with E D i through self-attention. We denote this method as w/ [SEP].
Then, following the same process as the original BERT, we apply a linear layer with the Tanh activation function to the last layer of the hidden-state at the [CLS] token, T [CLS] , for making the input of the softmax layer. We use cross-entropy loss for the training.
In addition to the paired input, we examine another setting that concatenates label texts and a document text without utilizing [SEP] or discriminating their segment embeddings. The procedure of computing the token embeddings stays consistent with the paired input setting. We denote this method as w/o [SEP].

Further Enhancement Using tf-idf
In addition to encoding the original texts of labels into BERT with the document, we experiment with selecting more words as representatives for each class, which expands the number of tokens in L j . We investigate whether this enhancement can further improve the performance of our models. After tokenizing all the documents under one class in the training set by using the Bert Tokenizer based on WordPiece (Wu et al., 2016), we calculate the average tf-idf score of each subword and add the top 5, 10, 15, or 20 as the supplemental label texts to the corresponding class.

Datasets
To evaluate the effectiveness of our method, we performed experiments on six benchmark datasets. As the original benchmarks do not include the development set, we randomly created it from the training set (after removing duplicate samples) for each dataset in accordance with the class distribution of the original test set.
We introduce the original size of each dataset below; see Table 1 for detailed statistics of our training, development, and test sets. Except for IMDb, all the datasets we used were originally constructed by Zhang et al. (2015).
• AGNews A news article dataset with titles and descriptions, containing 120,000 training samples and 7600 for testing. Four classes are included: World, Sports, Business, and Science & Technology.
• DBPedia An ontology classification over 14 classes, containing 560,000 samples for training and 70,000 for testing.
• Yahoo! Answers Topic A dataset containing 1,400,000 training samples and 60,000 testing samples with ten categories. Each sample includes the question title, question content, and best answer.
• IMDb (Maas et al., 2011) A binary sentiment classification dataset containing 25,000 highly polar movie reviews for training, and 25,000 for testing. Since its training and test sets are originally of the same size, we merged them together and randomly split it into approximately 8:1:1 for training, development, and testing.
• Yelp Review Full A dataset extracted from Yelp Dataset Challenge 2015 data by randomly taking 130,000 training samples and 10,000 testing samples for each starred review from 1 to 5. In total, there are 650,000 training samples and 50,000 testing samples.
• Yelp Review Polarity A dataset also extracted from Yelp Dataset Challenge 2015 data but coarsely divided into two classes, considering 1 and 2 stars as negative, and 4 and 5 as positive. In total, there are 560,000 training samples and 38,000 testing samples.

Settings
For both the baselines (BERT and LESA-BERT) and our proposed methods, we used the pre-trained uncased BERT-base model (Wolf et al., 2019), which consists of 12 Transformer blocks (Vaswani et al., 2017) with 12 self-attention heads and the hidden size of 768. We set the learning rate to 2e-5 and the batch size to 24. The drop-out probability was kept at 0.1. For optimization, we used AdamW (Loshchilov and Hutter, 2018) with epsilon of 1e-8. The models were trained for five epochs for each benchmark. At the end of each epoch, they were evaluated on the development set, and the ones with the highest accuracy were saved. We report those models' performance on the test set. The training was done for AGNews and DBPedia on 2080Ti and for the rest on Titan RTX. See Table  1 for the maximum sentence length and warm-up steps we assigned for each dataset. We decided the max length based on the average length statistics from  to fully utilize the GPU memory.
Note that we used adjectives "bad, poor, fair, good, excellent", representing the number of stars, instead of numbers 1 to 5 for the basic label texts in the Yelp Review Full dataset, since numbers are used in various unrelated contexts, that may lead to ambiguity.
We fixed the number of top-ranked subwords added for each method on each dataset on the development set. For example, Table 3 shows the averaged results on the AGNews development set for the three methods with top-5, 10, 15, and 20 words added. LESA-BERT (Si et al., 2020), w/ [SEP], and w/o [SEP] all reach the highest accuracy when five words were added, and so this was their final configuration when tested. The comparative experiments were also conducted on the other five datasets (see Appendix B for details).

Experimental Results
In   detailed results). We find that fusing only original label texts either with or without [SEP] yielded an improvement over the baselines, except on Yahoo. We assume this is because the original labels are not discriminative enough for big datasets, and so they may corrupt the input rather than enhance it, that leads to the degradation in accuracy.
However, when the top-ranked words were added, the performance on Yahoo was boosted to exceed the baselines. We notice this improvement, caused by adding supplemental words, took place on most benchmarks. Please note that the added words can sometimes contribute to the performance improvement even for the baseline, LESA-BERT.
On the other hand, the performances of all methods dropped drastically on Yelp F.. We assume this is because the top-ranked subwords with averaged tf-idf scores may not be a good representative for the granularity and polarity of emotions, while they can be powerful enough for distinguishing between topics. The enhancement helped IMDb and Yelp P. but not Yelp F., though all are benchmarks for sentiment analysis. In contrast to IMDb and Yelp P., which have only positive and negative labels, Yelp F. has inherent labels, decided by contexts, and so the effect of the tf-idf-based enhancement might be restricted on Yelp F. because the tf-idf score represents only the importance of the words.
Note that w/o [SEP] is better than w/ [SEP] in most cases. The Next Sentence Prediction (NSP) task, used in BERT to learn sentence-level representations, concatenates two natural language sentences with a [SEP] token. On the other hand, when we concatenate a label sequence with an input document, the [SEP] token combines a non-natural language sequence with a natural language sentence. This difference may have caused the skewness between pre-training and fine-tuning in BERT, leading to the performance degradation. Thus, simply adding a label sequence as a prefix, as in the w/o [SEP] method, which provides information gain, could yield a more stabilized improvement. Next, we used t-SNE (Maaten and Hinton, 2008) to visualize the learned representations on a 2dimensional map, as shown in Figure 2. We visualize the vectors learned from the w/o [SEP] model for the Yelp F. test set. Each color represents a different class. The point clouds are T CLS vectors, and each point corresponds to a test sample. The large dots with black circles are the averaged vectors of T L j , which is the encoded embedding of each label. Compared with the embedding of [CLS], the label embeddings are more separated in the vector space. This is presumably the reason that the label embeddings can support classification.

Conclusion
We proposed a simple but effective method for fusing label embeddings into BERT while utilizing its inherent inputting structure and self-attention mechanism, which leads to having significant improvements on benchmarks of relatively small and medium sizes. The results from the experiments adding subwords with top-ranked average tf-idf scores as supplemental label texts demonstrated that our method can generally improve the performance as expected. As there may be more appropriate methods for constructing enhanced representations, we intend to explore this further in future work. We will also examine different ways of uncovering more potential of pre-trained attentional models like BERT.