FLiText: A Faster and Lighter Semi-Supervised Text Classification with Convolution Networks

In natural language processing (NLP), state-of-the-art (SOTA) semi-supervised learning (SSL) frameworks have shown great performance on deep pre-trained language models such as BERT, and are expected to significantly reduce the demand for manual labeling. However, our empirical studies indicate that these frameworks are not suitable for lightweight models such as TextCNN, LSTM and etc. In this work, we develop a new SSL framework called FLiText, which stands for Faster and Lighter semi-supervised Text classification. FLiText introduces an inspirer network together with the consistency regularization framework, which leverages a generalized regular constraint on the lightweight models for efficient SSL. As a result, FLiText obtains new SOTA performance for lightweight models across multiple SSL benchmarks on text classification. Compared with existing SOTA SSL methods on TextCNN, FLiText improves the accuracy of lightweight model TextCNN from 51.00% to 90.49% on IMDb, 39.8% to 58.06% on Yelp-5, and from 55.3% to 65.08% on Yahoo. In addition, compared with the fully supervised method on the full dataset, FLiText just uses less than 1% of labeled data to improve the accuracy by 6.59%, 3.94%, and 3.22% on the datasets of IMDb, Yelp-5, and Yahoo respectively.


Introduction
Developments in deep learning technology have great breakthroughs in most natural language processing (NLP) tasks, such as machine translation, sentiment analysis, and reading comprehension. (Devlin et al., 2019;Ji et al., 2020;Zhong et al., 2020;Tao et al., 2019;Zhang et al., 2019;Wang et al., 2019;Tian et al., 2020a,b) The success of these advancements is highly dependent on large-scale * Corresponding author. and high-quality manual labeled data. However, obtaining vast amounts of high-quality labeled data is expensive. Especially in certain fields, such as finance, medicine, law, and so on, text labeling relies on the in-depth participation of field experts. The rapid development of SSL technology is expected to significantly reduce the demand for labeled data. The core goal of this technology is to use a small number of labeled data and vast amounts of unlabeled data to train a model with good generalization performance to solve machine-learning problems. (Lee, 2013;Laine and Aila, 2016;Miyato et al., 2018;Tarvainen and Valpola, 2017;Berthelot et al., 2019b;Xie et al., 2020;Sohn et al., 2020;Berthelot et al., 2019a) Unsupervised data augmentation (UDA) (Xie et al., 2020) and Mix-Text  are SOTA SSL methods for text classification, and have been used to various tasks with notable success. In NLP, applying the SSL framework to deep pre-trained language models (e.g., BERT, GPT, and XLNet) has been demonstrated effective. However, the good performance of these SSL methods depends on a bulky "large model". In most practical situations, due to the large-scale parameters and slow inference speed, it is difficult to implement these models with limited resources, such as mobile devices (Sun et al.), online search engines (Lu et al., 2020), and edge platforms (Tambe et al., 2020). An intuitive idea to address the problem is to apply the SSL method on a small model, such as TextCNN, of which the parameter scale is about one or two orders of magnitude lower than that of BERT, as shown in Figure  1. However, many applications show that the existing SOTA SSL framework performs poorly on lightweight models. Furthermore, there is a lack of relevant research on the implementation of SSL on lightweight models. This paper develops an SSL framework on lightweight models, for faster and lighter semisupervised text classification (FLiText). We use a deep pre-trained inspirer network to learn the distribution relationship and the task-specific features of the data. Next, the inspirer network provides two types of regularization constraints on a lightweight model. The intuitive explanation is as follows: "Teachers not only teach results but also teach experiences in the learning process so that students can learn more effectively." To evaluate FLiText, we compare FLiText and SOTA methods on three benchmark text classification datasets. We also conduct an ablation study to verify the performance of each part of FLiText. The results show that FLiText can significantly improve the inference speed while maintaining or exceeding SOTA performance. Compared with UDA on TextCNN, FLiText improves the accuracy from 51.00% to 90.32% on the IMDb dataset, from 39.80% to 58.06% on the Yelp-5 dataset, and from 55.30% to 65.08% on the Yahoo dataset. Compared with the supervised learning on complete datasets, the performance is improved by 6.28%, 4.08%, and 3.81% on the three datasets respectively, by just using less than 1% labeled data. Our contributions can be summarized as follows: • To our best of knowledge, in NLP, FLi-Text is the first SSL framework proposed for lightweight models, which can achieve new SOTA SSL performance on multiple datasets.
• We experimentally demonstrate that FLi-Text using less than 1% labeled data outperforms the supervised method using complete datasets on a lightweight model.
• We propose a new semi-supervised distillation method for knowledge distilling from BERT to TextCNN, which outperforms output-based knowledge distillation (KD) significantly.
• We experimentally demonstrate introducing a consistent regularization framework in KD improves the performance of the student model. Our source code can be obtained from: https://github.com/ valuesimplex/FLiText.

Related Work
Semi-Supervised Learning: (Lee, 2013) uses the pseudo labels and the unlabeled data for supervised learning. (Rasmus et al., 2015) obtains the learning signal by autoencoder. (Laine and Aila, 2016) calculates the mean square error between the prediction of the current model and the average of the historical prediction to construct the consistency regularization. (Miyato et al., 2018) adopts the method of adversarial learning to generate noise. (Berthelot et al., 2019b) uses the average of the prediction of K types of data augmentation on unlabeled data to achieve consistency regularization. (Berthelot et al., 2019a) aligns the predicted distribution with the groundtruth distribution. UDA (Xie et al., 2020) achieves consistency regularization on unlabeled data after back translation and tf-idf representation.  proposes "TMix" data augmentation. (Ren et al., 2020) adds weights for each unlabeled sample. However, these SOTA methods all rely on the deep pre-trained language model such as BERT, and so far no research on the SSL on lightweight models has been shown.
Knowledge Distillation: (Hinton et al., 2015) uses student model to mimic teacher's prediction by soft target. (Tang et al., 2019) distills BERT into a single layer of BILSTM. For the first time, the knowledge of the Transformer-based model was distilled into the non Transformer-based model. (Sun et al., 2019) extracts knowledge from the intermediate layer of the BERT; (Sanh et al., 2019) distills knowledge during the model's pre-trained stage;  combines the above various methods and propose a two-stage distillation method. Although all of these methods have achieved excellent results, the transformer has the problem of a huge amount of parameters and high computational complexity.

Framework
As shown in Figure 2, the biggest difference between FLiText and the previous SSL model is the introduction of an inspirer network outside the lightweight target network. The inspirer network utilizes consistency regularization and data augmentation technology to sufficiently mine information and features from the unlabeled data and limited labeled data. Then it provides a regularized constraint on two levels (i.e., output and hidden spaces) to lead the lightweight target network to realize efficient SSL using only a few labeled data. The entire framework comes from two types of insights. First, (Ba and Caruana, 2014) mentions that training a lightweight model based on the output of a larger model would be better than on the original data. Additionally, the general approximation theorem (Nguyen et al., 2016) identifies that most functional spaces contained in lightweight models could cover the target function required by the downstream task. Therefore, as a supplement to the current optimizer, the inspirer network can provide a well-qualified regularized constraint for the training of the lightweight model. We define X = (x i , y i ), i ∈ (1, ..., n) as the labeled dataset, U = (u j ), j ∈ (1, ..., m) as the unlabeled dataset, where n is the number of labeled samples, m is the number of unlabeled samples.

Inspirer Network
The inspirer network comprises three parts: text encoder, text classifier, and feature projection. The text encoder is a pre-trained language model stacked with multiple transformer structures, such as BERT. Given an input sentence x i , we can obtain a representation of the feature vector of " where d is the dimension of hidden vector. We use h i and a two-layer multi-layer perceptron (MLP) to construct the text classifier and finetuning the downstream classification task. We denote the result obtained by the MLP as z To align the dimensions of BERT and TextCNN, we feed hidden state into the feature projection, Ig(·), which compose of a single MLP and a nonlinear activation function. The output can be denoted as If l i , where l ∈ L represents the number of transformer layers.

Target Network
The target network comprises a text encoder, a text classifier, and a feature projection. We use TextCNN (Kim, 2014) as the text encoder. Because of its lightweight and parallelism, it has been broadly applied to all types of text-treatment systems (Tao et al., 2019;Zhang et al., 2019;Wang et al., 2019;Tian et al., 2020a,b). Given an input sentence x i , we use TextCNN to extract its information and the max-pooling operation to obtain its vector representation, where d represents the dimension of hidden vector output by TextCNN. We use c i and an MLP to construct a text classifier for the downstream text classification task. We denote the result obtained by the MLP as z The structure of the feature projection is the same as that of the inspirer. The difference is that we use the feature map to replace the output of the transformer layer in the inspirer: is the feature projection of the target network, and T f k i is the projection representations of th k CNN filter.

Two-stage Learning
FLiText consists of two training stages: Inspirer pre-training and Target network training. In the first stage, we introduce a variety of advanced semi-supervised ideas to complete the inspirer's training at downstream tasks. During the second stage, FLiText maintains the inspirer's parameters unchanged and guides the training of the target network in the downstream tasks via multi-level regular constraints provided by the inspirer network,   Figure 2: The architecture of FLiText. EN-SUP is labeled data of which the language is English; The red and green circles denote the different category of the text; EN-UNSU is unlabeled data of which the language is English and FR-UNSUP is noise version of the unlabeled data of which the language is French; FD represents the feature-based distillation loss designed by FLiText; OD represents the output-based distillation loss.
ultimately achieving efficient semi-supervised distillation learning. By means of the two-stage training operation, FLiText finally completes the SSL on the lightweight target network.

Inspirer Network Training
The training method is inspired by a consistency regularization framework. The loss function consists of two parts: the cross-entropy loss applied to labeled data and the consistent regularization loss on the unlabeled data. Similar to (Xie et al., 2020), to restrain over-fitting, we also use training-signal annealing technology to balance the participation of labeled data in the training process. Given unlabeled data u i and its noise version a i ∈ X a , we calculate the inspirer training loss: where the superscript u is the unlabeled data identifier, a is the noise data identifier, (T ) is the inspirer identifier, N is the number of labeled samples, and C is the total number of labeled categories. p(·) is the predicted probability distribution produced by the model for input x and parameter θ. L CE is the standard cross-entropy loss applied to labeled samples and L (T ) is the objective function of the inspirer. Other symbols are the same as before.

Target Network Distillation
In FLiText, we use two types of distillation methods together with the consistency regularization framework to complete the guidance of the inspirer network to the target network, by applying a regularized constraint to the objective function of the target network.
Output-based Distillation. Like (Mukherjee and Hassan Awadallah, 2020) , we also use hard label or soft label for output-based KD method: where y (T ) i is the predicted label.
Feature-based Distillation. Due to the outputbased KD method does not account for the intermediate learning process, we next introduce another KD method: feature-based KD. (Jawahar et al., 2019) shows that BERT can capture surface, syntactic and semantic representations from lowlevel layer to high-level layer. Inspired by this, considering that the text features extracted by the CNN filters with different sizes are different, FLi-Text assumes that the linguistic level of features captured by the CNN filters increases with their size. For example, a convolution having a window size of 4 is mainly focus on word-level features, whereas filters having a window size of 15 can capture semantic-level features. As shown in Figure  3, the proposed hidden-space feature-based distillation scheme can achieve knowledge transfer from BERT to TextCNN. In this scheme, we align the small-size filters with the lower layers of the BERT and the large-size ones with the higher layers. This is equivalent to imposing an a priori constraint on TextCNN. Namely, small filters are required to capture word-level features, medium filters capture syntactic features, and large ones capture semantic features. We use the feature projection to match the  transformer's hidden states and feature maps. We complete the knowledge extraction by minimizing the mean-squared error between the two feature projections, which is recorded as the feature distillation loss, L f eature_distill .
Consistency Regularization. Owing to the differences in the parameter space and the network structure between the target and inspirer networks, there is a problem of knowledge loss during the learning process. If only the KD methods are adopted, the target network would not be able to learn some of the functional characteristics of the inspirer network. Therefore, we introduce consistency regularization to constrain the target network, which keeps it smooth enough in the function space. Thus, the network should be flat around the input data. Even if the input data change slightly or their forms change while remaining semantically unchanged, the output of the model can remain basically the same. This is consistent with the training of the inspirer network: Finally, the loss function of the target network is where the superscript (sup) is the labeled sample identifier, (unsup) is the unlabled sample identifier. L CE represents the classification loss calculated on labeled data. L

Dataset
We verify the performance of FLiText on three publicly available English text classification benchmark datasets: IMDb (Maas et al., 2011), Yahoo (Zhang et al., 2015) and Yelp-5 (Xie et al., 2020). From Yahoo and Yelp-5, we randomly sample 70,000 sentences of unlabeled data, and 5,000 sentences as test data to verify the SSL method. We also randomly select 70,000 sentences of labeled data as a full dataset for the supervision method. For all datasets, we use French as an intermediate language for back translation. Table 1 shows the statistical information.

Implementation Details
In all experiments, we set the max sentence length to be 256. The dropout rate is 0.5. We use Adam to optimize the parameters of each model. It is found Dev  Test  IMDb  2  25000  25000 25000  Yelp-5  5  70000  5000  5000  Yahoo  10  70000 5000 5000 that all of the methods including the proposed work in this paper and other methods for comparison can achieve the best performance within 10 epochs. In order to ensure the consistency of the experimental conditions, 10 epochs are uniformly used. For the inspirer network, we use BERT-based-uncased 1 as the encoder, a two-layer MLP with 768 hidden states, and tanh as the activation function. The learning rate is 2e-5 for the BERT encoder and 1e-3 for the MLP model. For the target network, we use the Glove 2 300d vector as the embedding layer initialization parameter, TextCNN as the encoder. We use filters with sizes of 2, 3, 5, 7, 9 and 11 respectively. The number of output channels is 200, and the max-pooling operation is used to extract key information. For the project layer, we use a single layer MLP with a hidden size of 256 and a Relu as the activation function. Most of the reports Xie et al., 2020;Sohn et al., 2020;Berthelot et al., 2019a) about SSL only report accuracy or error rate. Hence, we also use accuracy for comparison with other works in this paper.

Result
We evaluate FLiText and baselines under different numbers of labeled data. The amount of labeled data is 20, 500, and 2500 respectively for IMDb, 500, 1000, and 2500 respectively for Yelp-5 and Yahoo. All of the amounts of unlabeled data is 70000. The experimental results are shown in Table  2.
Compared with the supervised learning method. The results of FLiText and TextCNN(fully) in Table 2 show that, with only 500 labeled data, FLiText greatly exceeds the fully supervised method on the performance by 6.59%, 3.94%, and 3.22% for each dataset respectively. Also, as the size of the labeled data increases to 2500, the performance is further improved to be 7.06%, 6.28%, and 6.18%. This shows that FLiText is an effective SSL method for lightweight models.
Comparison with existing SOTA SSL methods on TextCNN. Since the TMix method proposed by MixText cannot be directly applied to TextCNN, we apply the UDA framework to TextCNN. Among the results of FLiText, UDA(TextCNN) and TextCNN, there are two major findings. Firstly, when using 500 labeled data, FLiText achieves an accuracy improvement of 11.8%, 18.26%, and 9.78%, compared to UDA(TextCNN) on the three datasets respectively. Secondly, in contrast, when using 2500 labeled data (5 times of FLiText), the accuracy of UDA(TextCNN) on the IMDb and YELP-5 is 3.16% and 6% lower than TextCNN(fully) respectively. This shows that, due to the limited feature extraction capabilities of the model, the application of UDA to TextCNN does not work. These two results show that FLiText is a SOTA semi-supervised text classification framework for lightweight models.
Comparison with existing SOTA SSL methods.
In this part, we compare the performance of FLi-Text, UDA, and MixText on the three datasets. Three conclusions are drawn from Table 2. Firstly, FLiText performs better on the IMDb and Yelp-5 datasets. For example, with 500 labeled data, the accuracy of FLiText on Yelp-5 is 1.53% and 3.72% higher than UDA and MixText, respectively. Secondly, on the IMDb dataset, as the number of labeled data decreases, FLiText has a more obvious advantage in performance compared with the other two methods. The same phenomenon can be observed on the Yelp-5 dataset. This shows that FLi-Text has a stronger ability to capture text features in scenarios with a few labeled data. Third, we also find that due to the relatively high difficulty for text classification of the Yahoo dataset with multiple 10 categories, the performance of FLiText is 1% to 2% lower than UDA or MixText under the three different volumes of labeled data. Overall, the performance of FLiText surpasses or approaches that of the SOTA frameworks on the semi-supervised text classification benchmarks, while the model obtained by FLiText is lighter (the scale of the parameters is only one-thousandth of UDA or MixText), and faster (the inference speed is 67 times faster than UDA or MixText). As a result, FLiText is a very practical framework, suitable for many actual   industrial scenarios, especially in resource-limited scenarios or large-scale online systems, such as ecommerce search and real-time recommendation systems.
Comparison with the lightweight BERT. As shown in Table 2, the lightweight BERT (ALBERT) does not perform well under the framework of UDA, and is worse than that of FLiText. For example, when using 500 labeled data, FLiText achieves accuracy improvement of 1.7%, 6.98% and 2.95%, compared to UDA(ALBERT) on the three datasets respectively. Moreover, even the base version of ALBERT has the same inference speed as BERT, which is 52 times of our method, as shown in Table  3.
Comparison with the KD method for BERT.
In the experiment, we also performed the KD methods of DistilBERT 6 , TinyBERT 4 , and TinyBERT 6 to get smaller and lighter models from BERT trained by UDA, to compare with FLi-Text. As shown in Table 2, the performance of "UDA+KD(DistilBERT 6 )" is worse than FLiText under almost all experimental conditions, where the accuracy of the former is at least 0.5% lower than that of the latter. The same conclusion can also be seen in the comparison with TinyBERT 4 , which is the fastest variant of BERT in our experiment as shown in Table 3. Compared with TinyBERT 6 , FLiText performs much better on the IMDb and Yelp-5 datasets. Though the performance of FLi-Text is about 1% lower than TinyBERT 6 on the dataset of Yahoo, it is 32× faster and 7× smaller than TinyBERT 6 , which is a valuable trade-off in the situations with low resources.
Comparison of the efficiency. From the results of Table 3, FLiText is 11.5× smaller and 67.2× faster than that of UDA(BERT), and it performs as well as UDA(BERT) on the datasets of IMDb and Yelp-5 with only 2.7% FLOPs. Compared with "UDA+KD(TinyBERT4)", which is the smallest variant of BERT in Table 2 and Table 3, FLiText is 2× smaller and 6.7× faster, and achieves accuracy improvement of about from 1.5% to 3% on the three datasets with 46.2% FLOPs. In terms of computational complexity, (Vaswani et al., 2017) shows that Multi-Head Self-Attention requires O(n 2 d + nd 2 ) operations while 1D-CNN requires O(k * n * d) operations, where n is the sequence length, d is the representation dimen-sion, k is the kernel size of convolutions. Therefore, the computational complexity of BERT is O(L * (n 2 d + nd 2 )) ≈ O(L * n * n * d), where L is the number of Transformer blocks, and the complexity of TextCNN is O(N * k * n * d), where N is the number of CNN kernels. Considering that N * k << L * n in our situation, so the computational complexity of FLiText is much smaller than that of UDA(BERT).

Different Combinations of Transformer Layers and CNN Filters
We choose different transformer layers and filters for multiple combinations and the results are shown in

Consistency Regularization Effect
We add a consistency regularization framework on the basis of KD to verify the performance of the former on the latter. The results are shown in Table 6. We observe that after the introduction of the consistency regularization, the accuracy of T extCN N (KD+CR) is increased by 0.22%, 0.66% and 0.28% on the three datasets respectively, compared to the T extCN N (KD) . In our opinion, the improvement of performance brought by the consistency regularization is task-independent and can Model Yelp-5 Yahoo T extCN N (KD) 57.98 65.81 T extCN N (KD+CR) 58.64 66.09  be used as a supplement to KD, so as to guide the student model to achieve better local smoothness.

Nonlinear Activation Function Effect
We find that adding a nonlinear transformation to the feature projection has a distinct impact on the performance of the model. In order to verify this impact, we examine the effect of using Relu, Tanh and avoid nonlinear transformation (∅). The results are shown in Table 7. It can be seen that when the nonlinear transformation is removed, FLiText only achieves an accuracy of 63.37%. Using Relu or Tanh offers a 1.71% or 1.07% boost in performance respectively.

Conclusion
SSL has made great progress, but its rapid development is accompanied by increasingly complex algorithms and a sharp increase in the amount of computation, which is undoubtedly a bottleneck to the actual use of these algorithms in the industry. Therefore we introduce FLiText, a light and fast SSL framework for text classification with a convolution network. We show that FLiText achieves new SOTA results on multiple benchmark datasets on a lightweight model. Moreover, FLiTex achieves a close or even better performance compared to the previous SOTA SSL methods, while maintains a lightweight architecture with only one-thousandth of the parameters and a speed boost of more than 50 times. FLiText provides an effective way to deploy semi-supervised algorithms on resource-limited devices and industrial applications. In future research, we plan to apply FLiText to a wider range of NLP tasks, such as relation extraction and machine translation.