DisCo: Distilled Student Models Co-training for Semi-supervised Text Mining

Many text mining models are constructed by fine-tuning a large deep pre-trained language model (PLM) in downstream tasks. However, a significant challenge nowadays is maintaining performance when we use a lightweight model with limited labelled samples. We present DisCo, a semi-supervised learning (SSL) framework for fine-tuning a cohort of small student models generated from a large PLM using knowledge distillation. Our key insight is to share complementary knowledge among distilled student cohorts to promote their SSL effectiveness. DisCo employs a novel co-training technique to optimize a cohort of multiple small student models by promoting knowledge sharing among students under diversified views: model views produced by different distillation strategies and data views produced by various input augmentations. We evaluate DisCo on both semi-supervised text classification and extractive summarization tasks. Experimental results show that DisCo can produce student models that are 7.6 times smaller and 4.8 times faster in inference than the baseline PLMs while maintaining comparable performance. We also show that DisCo-generated student models outperform the similar-sized models elaborately tuned in distinct tasks.


Introduction
Large pre-trained language models (PLMs), such as BERT (Devlin et al., 2019), GPT-3 (Brown et al., 2020), play a crucial role in the development of natural language processing applications, where one prominent training regime is to fine-tune the large and expensive PLMs for the downstream tasks of interest (Jiao et al., 2020).
Minimizing the model size and accelerating the model inference are desired for systems with limited computation resources, such as mobile (Liu et al., 2021) and edge (Tambe et al., 2021) devices.Therefore, maintaining the generalization ability of the reduced-sized model is crucial and feasible (Sun et al., 2019;Sanh et al., 2019;Jiao et al., 2020;Wang et al., 2020).
Semi-supervised learning (SSL) emerges as a practical paradigm to improve model generalization by leveraging both limited labelled data and extensive unlabeled data (Rasmus et al., 2015;Lee et al., 2013;Tarvainen and Valpola, 2017;Miyato et al., 2019;Berthelot et al., 2019;Sohn et al., 2020;Fan et al., 2023;Zhang et al., 2021;Berthelot et al., 2022;Zheng et al., 2022;Yang et al., 2023).While promising, combining SSL with a reduced-size model derived from PLMs still necessitates a well-defined learning strategy to achieve improved downstream performances (Wang et al., 2022a).This necessity arises because these shallow networks typically have lower capacity, and the scarcity of labeled data further curtails the model's optimization abilities.Besides, a major hurdle is a lack of labelled data samples -a particular problem for text mining tasks because the labelling text is labour-intensive and error-prone (Gururangan et al., 2019;Chen et al., 2020;Xie et al., 2020;Lee et al., 2021;Xu et al., 2022;Zhao et al., 2023).
This paper thus targets using SSL to leverage distilled PLMs in a situation where only limited labelled data is available and fast model inference is needed on resource-constrained devices.To this end, we use the well-established teacher-student knowledge distillation technique to construct small student models from a teacher PLM and then finetune them in the downstream SSL tasks.We aim to improve the effectiveness of fine-tuning small student models for text-mining tasks with limited labelled samples.
We present DisCo, a novel co-training approach aimed at enhancing the SSL performances by using distilled small models and few labelled data.The student models in the DisCo acquire complemen-tary information from multiple views, thereby improving the generalization ability despite the small model size and limited labelled samples.we introduce two types of view diversities for co-training: i) model view diversity, which leverages diversified initializations for student models in the cohort, ii) data view diversity, which incorporates varied noisy samples for student models in the cohort.Specifically, the model view diversity is generated by different task-agnostic knowledge distillations from the teacher model.The data view diversity is achieved through various embedding-based data augmentations to the input instances.
Intuitively, DisCo with the model view encourages the student models to learn from each other interactively and maintain reciprocal collaboration.The student cohort with the model views increases each participating model's posterior entropy (Chaudhari et al., 2017;Pereyra et al., 2017;Zhang et al., 2018), helping them to converge to a flatter minimum with better generalization.At the same time, DisCo with the data views regularizes student predictions to be invariant when applying noises to input examples.Doing so improves the models' robustness on diverse noisy samples generated from the same instance.This, in turn, helps the models to obtain missing inductive biases on learning behaviour, i.e., adding more inductive biases to the models can lessen their variance (Xie et al., 2020;Lovering et al., 2021).
We have implemented a working prototype of DisCo 1 and applied it to text classification and extractive summarization tasks.We show that by cotraining just two student models, DisCo can deliver faster inference while maintaining the performance level of the large PLM.Specifically, DisCo can produce a student model that is 7.6× smaller (4layer TinyBERT) with 4.8× faster inference time by achieving superior ROUGE performance in extractive summarization than the source teacher model (12-layer BERT).It also achieves a better or comparable text classification performance compared to the previous state-of-the-art (SOTA) SSL methods with 12-layer BERT while maintaining a lightweight architecture with only 6-layer Tiny-BERT.We also show that DisCo substantially outperforms other SSL baselines by delivering higher accuracy when using the same student models in model size.

Overview of DisCo
DisCo jointly trains distilled student cohorts to improve model effectiveness in a complementary way from diversified views.As a working example, we explain how to use a dual-student DisCo to train two kinds of student models (see Figure 1).Extension to more students is straightforward (see section 2.3).To this end, DisCo introduces two initialization views during the co-training process: (i) model views which are different student model variants distilled from the teacher model, and (ii) data views which are different data augmented instances produced by the training input.
In DisCo, two kinds of compressed students (represented by two different colours in Figure 1(a)) are generated by the same teacher.This process allows us to pre-encode the model view specifically for DisCo.Additionally, we duplicate copies of a single student model to receive supervised and unsupervised data individually.In the supervised learning phase, DisCo optimizes two students using labelled samples.In the unsupervised learning phase, each student model concurrently shares the parameters with its corresponding duplicate, which is trained by supervised learning.The subsequent consistency training loss then optimizes the students using unlabeled samples.
For an ablation comparison of DisCo, we introduce the variant of DisCo only equipped with the model view, shown in Figure 1 (b).In this variant, labelled and unlabeled data are duplicated and would be fed to the students directly.DisCo and its variant ensure reciprocal collaboration among the distilled students and can enhance the generalization ability of the student cohort by the consistency constraint.In this section, we introduce DisCo from two aspects: knowledge distillation and the co-training strategy.

Student Model Generation
Our current implementation uses knowledge distillation to generate small-sized models from a PLM.Like the task-agnostic distillation of Tiny-BERT2 (Jiao et al., 2020), we use the original BERT without fine-tuning as the teacher model to generate the student models (In most cases, two student models at least are generated in our implementation).The task-agnostic distillation method is convenient for using any teacher network directly.
We use a large-scale general-domain corpus of WikiText-1033 released by Merity et al. (2017) as the training data of the distillation.The student mimics the teacher's behaviour through the representation distillation from BERT layers: (i) the output of the embedding layer, (ii) the hidden states, and (iii) attention matrices.

Model View Encoding
To ensure the grouped students present a different view of the teacher, we distil different BERT layers from the same teacher.Model view encoding diversifies the individual student by leveraging different knowledge of the teacher.We propose two different strategies for the knowledge distillation process: (i) Separated-layer KD (SKD): the student learns from the alternate k-layer of the teacher.For instance, {3, 6, 9, 12} are the 4 alternate layers of BERT.(ii) Connected-layer KD (CKD): the student learns from the continuous K-layer of the teacher.For example, {1, 2, 3, 4} are the continuous 4 layers of BERT.In the case of dual-student DisCo, the two students with two kinds of knowledge distillation strategies are represented as S AK and S BK .The co-training framework will encourage the distinct individual model to teach each other in a complementary manner underlying model view initialization.
With consistency constraints, our co-training framework can obtain valid inductive biases on model views, enabling student peers to teach each other and to generalize unseen data.Apart from the model views, we also introduce data views produced by various data augmentations of inputs to expand the inductive biases.

Data View Encoding
We use different data augmentation strategies at the token embedding layer to create different data views from the input samples.Our intuition is that advanced data augmentation can introduce extra inductive biases since they are based on random sampling at the token embedding layer with minimal semantic impact (Xie et al., 2020;Wu et al., 2020;Yan et al., 2021;Gao et al., 2021).Inspired by ConSERT (Yan et al., 2021) and Sim-CSE (Gao et al., 2021), we adopt convenient data augmentation methods: adversarial attack (Kurakin et al., 2017), token shuffling (Lee et al., 2020), cutoff (Shen et al., 2020) and dropout (Hinton et al., 2012), described as follows.Adversarial Attack (AD).We implement it with Smoothness-Inducing Adversarial Regularization (SIAR)4 (Jiang et al., 2020), which encourages the model's output not to change too much when a small perturbation is injected to the input.Token Shuffling (TS).This strategy is slightly similar to Lee et al. (2020) and Yan et al. (2021), and we implement it by passing the shuffled position IDs to the embedding layer while keeping the order of the token IDs unchanged.Cutoff (CO).This method randomly erases some tokens for token cutoff in the embedding matrix.Dropout (DO).As same as in BERT, this scheme randomly drops elements by a specific probability and sets their values to zero.
DisCo incorporates two forms of data view during co-training: a HARD FORM and a SOFT FORM.Taking dual-student networks for example, we use two different data augmentation approaches, such as AD and DO, to implement the HARD FORM data view.Regarding the SOFT FORM data view, we apply the same data augmentation approach, including AD with two rounds of random initialization to ensure distinct views.In DisCo, each student obtains perturbation differences through the various combinations of the HARD FORM and SOFT FORM.

Co-training Framework
Formally, we are provided with a semi-supervised dataset D, D = S ∪ U. S = {( x, ŷ)} is labelled data, where ( x, ŷ) will be used for two kinds of students identically.U = {x * } is unlabeled data, and two copies are made for two kinds of students identically.For X ∈ D, let ϕ A (X) and ϕ B (X) denote the two data views of data X.A pair of models (S AK = f A and S BK = f B ) are two distilled student models which we treat as the model view of dualstudent DisCo.Student f A only uses ϕ A (X), and Student f B uses ϕ B (X).
By training collaboratively with the cohort of students f A and f B , the co-training optimization objective allows them to share the complementary information, which improves the generalization ability of a network.Supervised Student Cohort Optimization.For supervised parts, we use the categorical Cross-Entropy (CE) loss function for optimizing student f A and student f B , respectively.They are trained with the labeled data ( x, ŷ) sampled from S. (1) Unsupervised Student Cohort Optimization.In standard co-training, multiple classifiers are expected to provide consistent predictions on unlabeled data x * ∈ U.
The consistency cost of the unlabeled data x * is computed from the two student output logits: z A (ϕ A (x * )) and z B (ϕ B (x * )).We use the Mean Square Error (MSE) to encourage the two students to predict similarly: Overall Training Objective.Finally, we combine supervised cross-entropy loss with unsupervised consistency loss and train the model by minimizing the joint loss: where µ(t, n) = min( t n , 1).It represents the rampup weight starting from zero, gradually increasing along with a linear curve during the initial n training steps.λ is the hyperparameter balancing supervised and unsupervised learning.

Co-training of Multi-student Peers
So far, our discussion has been focused on training two students.DisCo can be naturally extended to support not only two students in the student cohort but more student networks.Given K networks Θ 1 , Θ 2 , ..., Θ K (K ≥ 2), the objective function for optimising all Θ k , (1 ≤ k ≤ K), becomes: 7) Equation ( 5), is now a particular case of (6) with k = 2.With more than two networks in the cohort, a learning strategy for each student of DisCo takes the ensemble of other K −1 student peers to provide mimicry targets.Namely, each student learns from all other students in the cohort individually.(Jiao et al., 2020), (ii) semi-supervised UDA (Xie et al., 2020) and FLiText (Liu et al., 2021).We also compare with other prominent SSL text classification methods and report their results on the Unified SSL Benchmark (USB) (Wang et al., 2022a).Most of these SSL methods work well on computer vision (CV) tasks, and Wang et al. (2022a) generalize them to NLP tasks by integrating a 12-layer BERT.More detailed introductions are given in Appendix A.4.For extractive summarization tasks, we compare: (i) supervised basline, BERTSUM (Liu and Lapata, 2019), (ii) two SOTA semi-supervised extractive summarization methods, UDASUM and CPSUM (Wang et al., 2022b), (iii) three unsupervised techniques, LEAD-3, TextRank (Mihalcea and Tarau, 2004) and LexRank (Erkan and Radev, 2004).We use the open-source releases of the competing baselines.
4 Experimental Results

Evaluation on Text Classification
As shown in Table 2, the two students produced by DisCo with a 6-layer distilled BERT (S A6 and S B6 ) consistently outperform TinyBERT and UDA TinyBERT in all text classification tasks.Moreover, one student of our dual-student 6-layer DisCo
Compared with the FLiText, DisCo improves the average classification accuracy by 1.9% while using a student model with 0.7M fewer parameters than FLiText.FLiText relies heavily on backtranslation models for generating augmented data, similar to UDA.Unfortunately, this strategy fails to eliminate error propagation introduced by the back-translation model and requires additional data pre-processing.Besides, FLiText consists of two training stages and needs supervised optimization in both stages, increasing training costs and external supervised settings.
Table 3 shows results when comparing DisCo to other prominent SSL methods which are integrated with a 12-layer BERT.We take the results from the source publication or Unified SSL Benchmark (USB) (Wang et al., 2022a) for these baselines.However, most of them perform worse than DisCo's students only with a 6-layer BERT using same labeled data.In the case of Yahoo!Answer text classification, our 6-layer BERT-

Evaluation on Extractive Summarization
For the semi-supervised extractive summarization tasks, our dual-student DisCo outperforms all baselines in Table 4.Despite using a smaller-sized, 4-layer model, DisCo performs better than the 12layer BERTSUM, UDA, and CPSUM.The results show that our methods can reduce the cost of supervision in extractive summarization tasks.Other ROUGE results with 10 or 1000 labeled examples are presented in Appendix A.5.

Model Efficiency
As shown in Table 5, compared with the teacher BERT BASE , all 4-layer student models give faster inference time by speeding up the inference by 4.80×-7.52×for the two tasks.FLiText is slightly faster than the smaller model generated DisCo.This is because FLiText uses a convolutional network while our student models use BERT with multi-head self-attention.The lower computational complexity of convolutional networks5 .However, despite the FLiText having more parameters, it gives worse performance (about 3.04% accuracy defects on average), as shown in Table 2.

Effect of using Multi-student Peers
Having examined the dual-student DisCo in prior experiments, our next focus is to explore the scalability of DisCo by introducing more students in the cohort.As the results are shown in Table 6, we can see that the performance of every single student improves with an extension to four students in the DisCo cohort, which demonstrates that the generalization ability of students is enhanced when they learn together with increasing numbers of peers.
Besides, the results in Table 6 have validated the necessity of co-training with multiple students.It is evident that a greater number of student peers (multi-students) in the co-training process yields a considerable performance enhancement compared to a less populous student group (dual-students).

Effect of using Multi-View Strategy
As shown in Table 8, DisCo composed of the student networks distilled from the teacher is obviously superior to DisCo composed of two randomly initialized student networks, which verifies the advantage of our model view settings.In DisCo, the data view of SOFT FORM and HARD FORM brings the best effect, namely combinations of DO and AD encoded data view.Other data views with combinations of TS and CO yielded sub-optimal effects, which are presented in Appendix A.5.Under the same model view, DisCo integrating with the SOFT FORM data view is slightly better than the one using HARD FORM data view.The observations indicate adversarial perturbations are more useful for dualstudent DisCo.Modelling the invariance of the internal noise in the sentences can thus improve the model's robustness.
Further, we plot the training loss contour of DisCo and its ablation model in Figure 2.Both models have a fair benign landscape dominated by a region with convex contours in the centre and no dramatic non-convexity.We observe that the optima obtained by training with the model view and the data view are flatter than those obtained only with a model view.A flat landscape implies that the small perturbations of the model parameters cannot hurt the final performance seriously, while a chaotic landscape is more sensitive to subtle changes (Li et al., 2018).
Table 8: The impact of incorporating multi-view encoding for the dual-student DisCo.The HARD data-view is created using dropout (DO) and adversarial attack (AD).The SOFT view employs adversarial attack (AD) with varying initialization.The model-view ( ) refers to that students are trained from scratch without any teacher knowledge.In the preceding analysis detailed in Table 2, UDA/FLiText utilized back translation as their data augmentation strategy, a technique distinctly different from the token embedding level data augmentation employed in our DisCo framework.To ensure a balanced comparison, we substituted the back translation approach with our AD augmentation method for UDA/FLiText.The outcomes of this modification are portrayed in Table 9.These results underscore that regardless of the data augmentation strategy implemented, the performance of both UDA and FLiText falls short compared to our DisCo framework.This substantiates our claim that our co-training framework is superior in distilling knowledge encapsulated in unsupervised data.Furthermore, the performance across most tasks experiences a decline after the augmentation technique alteration.As stipulated in (Xie et al., 2020), the UDA/FLiText framework necessitates that augmented data maintain 'similar semantic meanings' thereby making back-translation a more suitable for UDA/FLiText, compared to the AD augmentation we incorporated.

Conclusion
In this paper, we present DisCo, a framework of co-training distilled students with limited labelled data, which is used for targeting the lightweight models for semi-supervised text mining.DisCo leverages model views and data views to improve the model's effectiveness.We evaluate DisCo by applying it to text classification and extractive summarization tasks and comparing it with a diverse set of baselines.Experimental results show that DisCo substantially achieves better performance across scenarios using lightweight SSL models.

Limitations
Naturally, there is room for further work and improvement, and we discuss a few points here.In this paper, we apply DisCo to BERT-based student models created from the BERT-based teacher model.It would be useful to evaluate if our approach can generalize to other model architectures like TextCNN (Kim, 2014) and MLP-Mixer (Tolstikhin et al., 2021).It would also be interesting to extend our work to utilize the inherent knowledge of other language models (e.g., RoBERTa (Liu et al., 2019), GPT (Radford et al., 2018;Radford et al.;Brown et al., 2020), T5 (Raffel et al., 2020)).
Another limitation of our framework settings is the uniform number of BERT layers in all distilled student models.To address this, students in DisCo can be enhanced by introducing architectural diversity, such as varying the number of layers.Previous studies (Mirzadeh et al., 2020;Son et al., 2021) have demonstrated that a larger-size student, acting as an assistant network, can effectively simulate the teacher and narrow the gap between the student and the teacher.We acknowledge these limitations and plan to address them in future work.

Ethical Statement
The authors declare that we have no conflicts of interest.Informed consent is obtained from all individual participants involved in the study.This article does not contain any studies involving human participants performed by any authors.

A Appendix
A.1 Background and Related Work Knowledge Distillation (KD).The KD (Hinton et al., 2015) is one of the promising ways to transfer from a powerful large network or ensemble to a small network to meet the low-memory or fast execution requirements.BANs (Furlanello et al., 2018) sequentially distill the teacher model into multiple generations of student models with identical architecture to achieve better performance.BERT-PKD (Sun et al., 2019) distills patiently from multiple intermediate layers of the teacher model at the fine-tuning stage.DistilBERT (Sanh et al., 2019) and MiniLM (Wang et al., 2020) leverage knowledge distillation during the pre-training stage.TinyBERT (Jiao et al., 2020) sets a twostage knowledge distillation procedure that contains general-domain and tasks-specific distillation in Transformer (Vaswani et al., 2017).Despite their success, they may encounter difficulties affecting the sub-optimal performance in language understanding tasks due to the trade-off between model compression and performance loss.

Co-Training.
It is a classic award-winning method for semi-supervised learning paradigm, training two (or more) deep neural networks on complementary views (i.e., data view from different sources that describe the same instances) (Blum and Mitchell, 1998).By minimizing the error on limited labelled examples and maximizing the agreement on sufficient unlabeled examples, the co-training framework finally achieves two accurate classifiers on each view in a semi-supervised manner (Qiao et al., 2018).

A.2 Hyperparameters
The BERT BASE , as the teacher model, has a total of 109M parameters (the number of layers N = 12, the hidden size d = 768, the forward size d ′ = 3072 and the head number h = 12).We used the BERT tokenizer6 to tokenize the text.The source text's max sentence length is 512 for extractive summarization and 256 for text classification.For extractive summarization, we select the top 3 sentences according to the average length of the Oracle human-written summaries.We use the default dropout settings in our distilled BERT architecture.The ratio of token cutoff is set to 0.2, as suggested in (Yan et al., 2021;Shen et al., 2020).The ratio of dropout is set to 0.1.Adam optimizer with β 1 = 0.9, β 2 = 0.999 is used for fine-tuning.We set the learning rate 1e-4 for extractive summarization and 5e-3 for text classification, in which the learning rate warm-up is 20% of the total steps.The λ for balancing supervised and unsupervised learning is set to 1 in all our experiments.The supervised batch size is set to 4, and the unsupervised batch size is 32 for the summarization task (16 for the classification task) in our experiments.

A.4 Baselines Details
For the text classification task, TinyBERT (Jiao et al., 2020) is a compressed model implemented by 6-layer or 4-layer BERT BASE .For semi-supervised methods, we use the released code to train the UDA, which includes ready-made 12-layer BERT BASE , 6layer, or 4-layer TinyBERT.FLiText (Liu et al., 2021)  (e.g., 10 per class), KL divergence may surpass MSE in performance.This can be attributed to the noisy predictions produced by the student model, as its performance is not optimal because of the limited labeled data.KL divergence enforces label matching, thereby reducing issues resulting from corrupted knowledge transferred from another student model (Kim et al., 2021).

A.10 Details in Loss Landscape Visualization
Our loss visualization approach adheres to the 'filter normalization' method (Li et al., 2018).For each setting, we select the top-performing student checkpoint based on its validation set results.Subsequently, we generate two random vectors and normalize them using parameters specific to each model.Ultimately, using the same training data and augmentation techniques, we plot the training loss landscape following the two normalized directions.

Figure 1 :
Figure 1: The training architecture of DisCo (a) and the ablation variant (b).refers to 'DO USE' the and is 'DO NOT USE' .L s is a supervised loss and L u is unsupervised.'KD' is an abbreviation for knowledge distillation.
Figure 2: 2D visualization of the loss surface contour of DisCo (w.model view and w. data view) and its ablation variant (w.model view).Subfigures (a) and (b) are the text classification tasks for Agnews dataset with 10 labeled data per class.Subfigures (c) and (d) are the extractive summarization tasks with 100 labeled data.

Table 1 :
Dataset statistics and dataset split of the semi-supervised extractive summarization dataset and several typical semi-supervised text classification datasets, in which '×' means the number of data per class.

Table 2 :
Text classification performance (Acc (%)) on typical semi-supervised text classification tasks.P is the number of model parameters.The best results are in-bold.Implementation Details.The main experimental results presented in this paper come from the best model view and data view we found among multiple combinations of view encoding strategies.Taking dual-students DisCo as an example, we present the results of S AK and S BK , as the model-view being a combination of SKD (alternate K-layer) and CKD (continuous K-layer).The data view is the SOFT FORM of two different AD initialization.
(Mendes et al., 2012))isCo on extractive summarization and text classification tasks, as shown in Table1.For extractive summarization, we use the CNN/DailyMail(Hermann et al., 2015)dataset, training the model with 10/100/1000 labeled examples.Regarding text classification, we evaluate on semi-supervised datasets: Agnews(Zhang et al., 2015)for News Topic classification, Yahoo!Answers(Chang et al., 2008)for Q&A topic classification, and DBpedia(Mendes et al., 2012)for WikiPedia topic classification.The models are trained with 10/30/200 labeled data per class and 5000 unlabeled data per class.Further details on the evaluation methodology are in Appendix A.3. data-view of AD.DisCo (S A4 and S B4 ) use similar combinations to the DisCo (S A6 and S B6 ).DisCo (S A2 ) uses CKD with {1, 2} BERT layers and DisCo (S B2 ) uses CKD with {11, 12} BERT layers.The details of DisCo's hyperparameter are presented in Appendix A.2.We run each experiment with three random seeds and report the mean performance on test data and the experiments are conducted on a single NVIDIA Tesla V100 32GB GPU.Competing Baselines.For text classification tasks, we compare DisCo with: (i) supervised baselines, BERT BASE and default TinyBERT

Table 3 :
(Wang et al., 2022a)performance (Acc (%)) of other prominent SSL text classification models and all results reported by the Unified SSL Benchmark (USB)(Wang et al., 2022a).D refers to datasets, L m is the number of the BERT layers used by models and L d is labeled data per class.

Table 4 :
ROUGE F1 performance of the extractive summarization.L d =100 refers to the labeled data per class.SSL baselines (CPSUM and UDASUM) use the same unlabeled data as DisCo has used.

Table 5 :
Model efficiency about the model size and inference speedup on a single NVIDIA Tesla V100 32GB GPU.T TS (ms) refers to the speedup of extractive summarization models trained with 100 labeled data.T TC (ms) illustrates the speedup of text classification models trained with Agnews 200 labeled data per class.

Table 6 :
Text classification performance (Acc (%)) of DisCo with multiple student peers.The students (S A2 , S B2 , S C2 , S D2 ) are distilled from layers {1, 2}, {3, 4}, {9, 10} and {11, 12} of the teacher BERT BASE , respectively.The first four students adopt HARD FORM data views which are AD, DO, TS, and CO, respectively.The last four students adopt a SOFT FORM data view with different DO initialization.Better results than dual-student DisCo in Table 2 is in-bold.

Table 7 :
Performance comparison between DisCo and a single student model with AD augmentation.The 'Sin-gleStudent' is the better-performing model among the two students within the DisCo framework.

Table 13 :
is a lightweight and fast semi-supervised learning framework for the text classification task.FLiText consists of two training stages.It first Comparison between MSE loss and KL divergence in 4-layer DisCo.