Progressive Self-Training with Discriminator for Aspect Term Extraction

Aspect term extraction aims to extract aspect terms from a review sentence that users have expressed opinions on. One of the remaining challenges for aspect term extraction resides in the lack of sufficient annotated data. While self-training is potentially an effective method to address this issue, the pseudo-labels it yields on unlabeled data could induce noise. In this paper, we use two means to alleviate the noise in the pseudo-labels. One is that inspired by the curriculum learning, we refine the conventional self-training to progressive self-training. Specifically, the base model infers pseudo-labels on a progressive subset at each iteration, where samples in the subset become harder and more numerous as the iteration proceeds. The other is that we use a discriminator to filter the noisy pseudo-labels. Experimental results on four SemEval datasets show that our model significantly outperforms the previous baselines and achieves state-of-the-art performance.


Introduction
Aspect term extraction (ATE) is a crucial task in aspect-level sentiment analysis, aiming to extract all aspect terms present in the sentence (Pontiki et al., 2014). For example, given a restaurant review "I looove their eggplant pizza, as well as their pastas!", ATE system aims to extract "eggplant pizza" and "pastas".
Many researchers typically formulated ATE as a sequence labeling problem or a token-level classification problem. The current state-of-the-art neural models can be classified into two categories. One designs the sophisticated model with a variety of techniques, such as history attention (Li et al., 2018), sequence to sequence (Ma et al., 2019), and constituency lattice (Yang et al., 2020). Although these models achieve satisfactory performance, its sufficient condition is the availability of sufficient * Corresponding author. training data. However, labeling a large amount of aspect data may not be practical due to its cost.
The other aims at addressing the data insufficiency issue from a different perspective. For example, Li et al. (2020) generated the reviews while preserving the original aspects via formulating the data augmentation as a conditional generation task. However, varying only a small number of consecutive non-aspect words in the reviews would limit the semantic diversity of the sample. Chen and Qian (2020) tackled long-tail distributions problem of aspect terms and context words in the training sets with soft prototypes. Nevertheless, the way that the soft template implicitly uses external data discounts the usefulness of external data.
Different from previous approaches, in this paper, we use self-training (Scudder, 1965) to alleviate the labeled data insufficiency. In self-training, a base model trained on the labeled data is used to infer pseudo-labels on the unlabeled data, and then a new base model is trained to optimize the loss on human labels and pseudo-labels jointly. We can iterate this algorithm a few times by using new base model to relabel the unlabeled data and retraining a new base. Hence, we can supplement the labeled data with some pseudo-labeled data.
However, the approaches relying on self-training typically suffer from noise induced by pseudolabels. In this paper, we use two means to mitigate the negative effects of pseudo-labels. One is to refine the conventional self-training to progressive self-training. Here, we use a progressive subset at each iteration instead of the entire unlabeled data set. During the iterative process, the unlabeled samples in the subset become harder and more numerous. Our motivation stems from curriculum learning (Bengio et al., 2009), where we expect to infer pseudo-labels for unlabeled data in the order of easy to hard and few to many. In this process, easy unlabeled data will bring in little noise, and model that have been previously learned will be better and thus generate less noise at later stages.
The other is to use a discriminator to filter out as much noise as possible from the pseudo-labels. Inspired Mao et al. (2021), we construct a question sequence based on sentence and a pseudo-label aspect term in sentence that is fed to the discriminator to make true-false prediction. To train the discriminator, we fabricate several positive and negative samples from the training set. Here, the negative samples are constructed based on the left and right boundary errors of aspect terms and the errors of non-aspect terms with the same POS tag as aspect terms. A pseudo-labeled sentence is used to train the new base model only if its all aspect terms are true; otherwise, it is filtered out. Moreover, we can also apply self-training method to train the discriminator to enhance its discriminatory power.
Our method follows multiple steps.
Step 1: train a base model on the labeled data; Step 2: divide the unlabeled data into progressive subsets for curriculum learning based on the difficultness and quantity; Step 3: synthesize positive and negative samples on the train set to train a discriminator; Step 4: use the base model to infer the pseudo-labels of the samples in current unlabeled subset, and then filter the noisy pseudo-labels with discriminator; Step 5: retrain a new base model (discriminator) using the labeled data and the filtered pseudo-labeled data.
Steps 4 and 5 are repeated until the progressive subsets are exhausted (i.e., the curriculum learning is completed).
Overall, we make the following contributions: (a) To the best of our knowledge, we are the first to use self-training to address the problem of insufficient labeled data in ATE; (b) To mitigate the noise introduced by self-training, we refine the general self-training to progressive self-training and bring in a discriminator to filter the noisy pseudo-labels; (c) Experimental results on four ATE datasets show that our method outperforms the baselines and achieves the state-of-the-art performance. Furthermore, we conduct extensive experiments to verify its effectiveness and generalization.

Related Work
Aspect Term Extraction Earlier research endeavors focused on exploiting pre-defined rules (Hu and Liu, 2004;Wu et al., 2009), hand-craft features (Liu et al., 2012), or prior knowledge (Chen et al., 2014) to solve ATE. Recently, researchers employed some deep learning and parse techniques to ATE, such as LSTM (Liu et al., 2015), CNN (Xu et al., 2018), Attention (Li et al., 2018), BERT (Xu et al., 2019), and constituency parsing (Yang et al., 2020). A recent trend is towards the unified framework Mao et al., 2021). So far, one of the remaining challenges for ATE is the insufficient of annotated data, especially as neural models become more large and more complex. To address this issue, Li et al. (2020) presented a conditional data augmentation approach for ATE. In addition, to solve the data sparsity problem, Chen and Qian (2020) introduced soft prototypes trained by internal or external data. In this paper, we focus on the the insufficient of labeled data scenario, and alleviate it via self-training and unlabeled data.
Self-training The self-training proposed by Scudder (1965) is a semi-supervised approach that leverages unlabeled data to create better models. Self-training first trains a base model on a small amount of labeled data; then utilizes it to pseudolabel unlabeled data, and uses pseudo-labels data to augment the labeled data; finally iteratively retrains the model. Recently, it yields state-of-the-art performance on machine learning tasks like image classification (Zoph et al., 2020), few-shot text classification (Mukherjee and Awadallah, 2020), and neural machine translation (He et al., 2019). The error propagation (Wang et al., 2021) from noisy pseudo-labels is an obvious problem in selftraining. In this paper, we alleviate noisy in the pseudo-labels by using progressive subsets (i.e., curriculum learning) and a discriminator.
Curriculum Learning Learning from easier samples first and harder samples later is a common strategy in curriculum learning (Bengio et al., 2009). Our progressive self-training method focuses on easier samples in the early stage, and uses hard samples in the later stage. Our aim is to reduce the noise in the pseudo-labels: the pseudolabels for easy examples are less prone to errors, and model that have been previously learned could yield more accurate pseudo-labels at later stage.

Problem Formulation
Given a token sequence x = {x 1 , x 2 , ......, x n } of length n, the ATE task can be characterized as a token-level classification problem. The ATE model takes x as input and outputs a label sequence y = {y 1 , y 2 , ......, y n }, where y i ∈ {B, I, O} is  Figure 1: An illustration of our method. "filtered pseudo-labeled data" indicate that they are unavailable when the iteration is not started. The progressive subsets means that the samples in the subset are becoming harder and more numerous for curriculum learning.
used to indicate if the corresponding token is at the beginning, inside or outside of an aspect term. Given a labeled dataset D = {(x i , y i )} i and a unlabeled dataset D u = {x i } i , our method aims to yield a competitive ATE model.

Overview
Figure 1 provides an illustration of our method. We first train a base model via the standard crossentropy loss using the labeled dataset. We then use the base model to estimate the difficultness of the unlabeled samples. Thus, we can divide them into progressive subsets for curriculum learning, where the subsets keep the difficultness increment and the sample amount increment. Meanwhile, we synthesize the training data of the discriminator, and train a discriminator. We then utilize the base model to infer pseudo-labels on the current unlabeled subset. Intuitively, easy unlabeled data is less prone to noise, and previously learned model will be better and thus generate less noise at later stages. In addition, to reduce noise as much as possible, we apply a discriminator to filter noisy pseudo-labels where the filtered pseudo-labeled data can be used to train better discriminator. We then train a new base model by pretraining on the filtered pseudo-labeled data and finetuning on the labeled data. Finally, we iterate this process by using new base model to infer pseudo-labels on the next unlabeled subset.

Aspect Term Extraction Model
We formulate ATE as a token-level classification task, where for each token x i in the sentence, our ATE model assigns a label y i . Our ATE model uses BiLSTM (Hochreiter and Schmidhuber, 1997) or BERT (Devlin et al., 2019) as encoder. The encoder takes a sequence of tokens as input, and produces a sequence of contextual hidden states. To obtain the logits, we attach a linear layer to the end of the encoder. During the training phase, the encoder and the linear layer are trained by minimizing the cross-entropy loss: where CE is the cross-entropy loss function, f AT E denotes our ATE model parameterized by θ AT E , and n is the length of the token sequence.
In the inference phase, the ATE model predicts the sequence of labels with the following equation:

Progressive Subsets
The conventional self-training performs inference on all unlabeled data, which undoubtedly leads to much noise. Inspired by the curriculum learning (Bengio et al., 2009), we refine the conventional self-training into the progressive self-training. We assume that in the early stages, easy unlabeled samples are not prone to induce noise, and in the late stages, the learned model has been better and will reduce noise generation on hard unlabeled samples. To divide the unlabeled data into progressive subsets, we define the difficultness of the samples based on the average logit of the tokens. We consider that the larger the logit, the more information it contains and the more confident the predictions of the model will be, and hence the easier the unlabeled sample.
(3) where g i ∈ R 3 is the logit vector of the token x i , and g i[ y i ] is a logit value corresponding to prediction y i . degree indicates the difficultness of the sample, and the larger the value the easier the sample is. In addition, we find that the progressive subset size is kept incremental in favor of performance improvement.

Discriminator
Intuitively, filtering out all the noise in the pseudolabels accurately and automatically is not quite realistic. We can only filter the noise as much as possible, and to this end, a discriminator is introduced. It makes a true-false determination for each of the inferred aspect terms based on the corresponding contextual. Subsequently, we evaluate whether the sample is suitable for re-training the base model based on the discrimination results of all aspect terms in the sample.
Inspired Mao et al. (2021), we formulate this identification task as a question answering problem, where for each sentence we ask in turn whether the inferred aspect term is true and we expect the response to be affirmative or negative. To derive a suitable input, we pack sentence and custom question as an input sequence. The input sequence is obtained as follows: a [CLS] token is added to the token sequence at the beginning, and two [SEP] tokens are inserted at the end of both the sentence and the custom question, respectively. For instance, we can derive an input sequence based on the above review: [CLS] I looove their eggplant pizza , as well as their pastas ! [SEP] Is " pastas " an aspect term in the sentence ? [SEP] For simplicity, the encoder of the discriminator is identical to that of the ATE model. Here, the final hidden state corresponding to [CLS] token is used as the aggregate sequence representation and fed into the classifier. Suppose the dataset a is an aspect term in x, and y ∈ {0, 1} is the label of a, we can optimize the discriminator by the following equation: (x, a, y) = BCE(f dis (x, a, θ dis ), y) (4) where BCE is the binary cross-entropy loss function and f dis is a discriminator parameterized by θ dis . Subsequently, the trained discriminator is used to do true-false determination for each inferred aspect termã to filter the noisy pseudolabels. y = INT(sigmoidf dis (x,ã, θ dis ) >= 0.5) (5) where INT maps true and false to 0 and 1, respectively.
However, we can only obtain positive samples in D d from the ATE dataset, but not negative samples. We observe that the wrong aspect terms tend to be boundary errors and non-aspect term errors. Inspired by this observation, we synthesize negative samples based on left and right boundary errors and the errors of non-aspect terms with the same POS tag 1 as aspect terms. Table 1 gives examples of wrong aspect terms.

Training
We first train a base model on labeled data and use the average logit from the base model to partition 1 We use NLTK to derive the POS tag of each token.  the unlabeled data into progressive subsets; second, train a discriminator on the synthetic dataset; third, infer pseudo-labels on the current unlabeled subset and filter noisy aspect terms with discriminator; then train a new base model and discriminator using both labeled data and filtered pseudo-labeled data; and finally, apply this new base model to the next unlabeled subset. To understand our method clearly, Algorithm 1 procedure is presented.
Algorithm 1 progressive self-training with discriminator use fAT E to infer pseudo-labels on D u/i via Eq. 2, use f dis to filter the noise in the pseudo-labels via Eq. 5, and obtain D u/i ∈ D u/i 8:

Datasets
We conduct experiments on four datasets from Se-mEval 2014 Task 4 (Pontiki et al., 2014), SemEval 2015 Task 12 (Pontiki et al., 2015), and SemEval 2016 Task 5 (Pontiki et al., 2016). Statistics of the datasets are presented in Table 2. In addition, as Xu et al. (2018) did, we randomly hold out 150 examples from the train set as the validation set for tuning hyper-parameters. We employ the F1 metric to evaluate the performance of the models.
We select the first 2,754 and 6,754 samples from Amazon Cell Phones and Accessories dataset 2 (He and McAuley, 2016) and Yelp Review dataset 3 (Zhang et al., 2015), respectively. The former is treated as unlabeled data in the laptop domain, while the latter is considered as unlabeled data in the restaurant domain. After these samples are preprocessed 4 , we can obtain 10k unlabeled data.

Implementation Details
We choose two representative encoders (BiLSTM and BERT) as the backbone to implement our method 5 . For BiLSTM encoder, the word embeddings are initialized with GloVe-840B-300d (Pennington et al., 2014). The hidden size is set to 300, and we use Adam (Kingma and Ba, 2014) with the learning rate of 1e-4 to optimize parameters. For BERT encoder, we use the BERT base with 12 attention heads, 12 hidden layers and the hidden size of 768, resulting into 110M pretrained parameters. During the fine-tuning process, we employ AdamW (Loshchilov and Hutter, 2018) to optimize parameters. The learning rates are 3e-5 and 3e-4 for the pre-trained parameters and the added parameters, respectively. In addition, we set batch size to 48 and dropout rate to 0.1. For the progressive set {D u/i } T i=1 , we set T to 4; and for each subset size, we set |D u/i | = i * 1k. We run all experiments in a single Tesla V100S GPU.

Baselines
To evaluate the effectiveness of our method, we compare it with four groups of baselines. The first group of baselines are the SemEval winners. IHS-RD (Chernyshevich, 2014), DLIREC (Toh and Wang, 2014), EliXa (San Vicente et al., 2015) and NLANGP (Toh and Su, 2016) are the winners for Lap14, Res14, Res15, and Res16 datasets, respectively. The second group of baselines generally employs neural networks with complex structures 2 https://jmcauley.ucsd.edu/data/amazon 3 https://www.yelp.com/dataset 4 It mainly consists of dividing clauses based on symbols (e.g., periods, question marks, and exclamation points), and word completions (e.g., replacing cant with can't). 5 Our code is available at: https://github.com/ qlwang25/progressive_self_training to solve ATE, such as MIN (Li and Lam, 2017), HAST (Li et al., 2018), Seq2Seq4ATE (Ma et al., 2019), DECNN (Xu et al., 2018), and CLATE (Yang et al., 2020). The third group of baselines aims to tackle the problem of insufficient annotated data, such as conditional data augmentation (CDA)  and soft prototype trained on external data (SoftProtoE) (Chen and Qian, 2020). The last group of baselines is our customized model for clear comparison. BiLSTM(BERT, BERT-PT)-TC uses the BiLSTM (pre-trained BERT, posttrained BERT-PT (Xu et al., 2019)) with a linear layer for token classification. BERT-RC (Mao et al., 2021) treats ATE as a reading comprehension task.   Other results are the average scores of three runs with random initialization. + denotes the method combined with the benchmark model; † indicates that the score is significantly better than that of the customized baseline at significance level p < 0.01. The scores of best baselines are italicized, and the best scores are in bold.

Main Results
The main experimental results on four datasets are reported in Table 3. We can draw the following conclusions from the table. First, our method sub-stantially enhances our custom baselines. For example, although BERT-TC achieves competitive performance among baselines, our method further achieves 3.85%, 3.18%, 3.57%, and 3.05% absolute gains on four datasets. Second, compared to BiLSTM, the performance of the baselines based on the pre-trained models is more significantly improved when combined with our method. We attribute this phenomenon that the pre-trained models could better alleviate the noise in the pseudo-labels. This also proves the result of Du et al. (2020) that the combination of pre-training and self-training can further improve performance. Third, BERT-PT exceeds most existing ATE models by a great margin, confirming the power of domain-specific post-training. Surprisingly, BERT-PT can be further improved significantly (2.68%, 2.43%, 1.97%, 4.23%) and reach a new state-of-the-art when combined with our method. Finally, our method is obviously more effective than SoftProtoE and CDA in alleviating the insufficient labeled data, and it is also notable that we use less unlabeled data (2,754 vs. 100,000).  Table 4: Ablation studies (F1 scores) on the components of our method. ST: conventional self-training; ST&Dis: conventional self-training method with discriminator (remove line 2 and for loop several times until convergence); PST: progressive self-training method without discriminator (remove line 3, 4, 7, 9, 11, and D u/i = D u/i ). Our method is equivalent to the combination of PST and Dis.

Ablation Studies
Compared to conventional self-training, our method differs in two aspects: based on the average logit of tokens, the unlabeled samples are divided into progressive subsets for curriculum learning, and a discriminator is used to filter as much noise as possible from the pseudo-labels. To verify the validity of these two points, we create three variants for conducting ablation studies. As shown in Table 4, all variants exceed the baseline, suggesting that the use of unlabeled data is helpful, even when strong language model is encountered. In addition, a modest gain (1.48%, 1.39%, 1.92%, 0.43%) over the peer is observed when self-training combined with discriminator, which shows that the discriminator improves the quality of pseudo-labeled data and thus the model performance by reducing noise. Among the three variants, progressive self-training is the best overall, suggesting that the quality of pseudo-labels can be effectively improved through the curriculum learning idea. Combining progressive self-training with discriminator can further improve performance, showing that both can complement each other in promoting the quality of pseudo-labeled data.  Table 5: Ablation studies (accuracy and F1 scores) on the error rules. E1, E2, and E3 denote left-boundary errors, right-boundary errors, and non-aspect term errors with the same POS tag, respectively. The best scores are in bold and the second-best scores are in italics. Note that we apply three error rules to synthesize negative samples in the test set.
To train the discriminator, we synthetic negative samples from labeled data under three error rules (Table 1). To verify the effectiveness of each rule, we conduct relevant ablation studies. As shown in Table 5, the discriminator achieves substantial gains on the combination of E1 and E2. This indicates that negative samples of the boundary error type play an essential role in training the discriminator. Additionally, the addition of E3 can improves the performance a bit more, showing that the error type of non-aspect terms is useful and reasonable.

Discussion
Performance on Different Amounts of Labeled Data To investigate the performance of our method when lack of labeled data, we intentionally control the amount of reviews in labeled data and run evaluations with the new training set. As shown in Figure 2, we observe that our method can significantly improve the scores compared to using only a small amount of labeled data (4.41% vs. 54.39% on Lap14 dataset, 57.03% vs. 70.4% on Res14 dataset). Moreover, our method substantially outperforms the conventional self-training method. In particular, our method shows a strong superiority when the proportion of original labeled data is less than 10% (39.1% vs. 54.39% on Lap14 dataset, 40.92% vs. 60.11% on Res14 dataset).

Effect of Progressive Subsets on Performance
Inspired by curriculum learning (Bengio et al., 2009), in this paper, we refine the self-training to the progressive self-training. We expect that in the early stages, easy unlabeled data induce less noise, while in the later stages, the model has become better after learning and can generate less noise on hard unlabeled data. To this end, we divide the unlabeled data into progressive subsets according to the order of increasing difficultness and quantity. To examine our motivation, we conduct relevant comparative experiments. As shown in Table 6, we observe that using harder and more unlabeled data in the early stages can have a discount on performance. The underlying reason may be the introduction of much noise, which makes the model difficult to learn. Moreover, this verifies the reasonable and effectiveness of the progressive subset from the side.
Effect of Retraining Way on Performance In our algorithm, we first pre-train the model on pseudo-labeled data and then finetune it on labeled data (line 10). Here, we compare with an alternative way which trains the model with labeled  Table 6: Comparison (F1 scores) of progressive subsets with different settings. →: use the order of unlabeled data. The difficultness of the unlabeled data is evaluated by the average logit of tokens; The quantity denotes the size of the subset, and D u/i =i * 1k, D u/i =2.5k, and D u/i =(5 − i) * 1k corresponds to its three settings respectively. data and pseudo-labeled data jointly. From Table  7, we can find that the combination of pretraining and finetuning slightly exceeds the joint training (84.19% vs. 83.41%). We observe that pre-training only on the pseudo-labeled data leads to lower F1 scores than training only on the labeled data (75.37% vs. 80.32%), suggesting that the distribution of the unlabeled data differs from that of the labeled data. In this case, pre-training first and then fine-tuning can relieve the effect of different data distributions.  Effect of Pre-trained Models of Different Power on Performance As can be seen from Table 3, the combination of self-training and pre-trained models may create more sparks. For further exploration and validation, we conduct comparative experiments using pre-trained models of different power as the backbone. We observe a significant increase in improvement from BERT mini to BERT base in Table 8, but the improvement seems to saturate when going from BERT base to BERT large . Therefore, we can conclude that the self-training method can create more gains when combined with a more capable pre-trained model, but the gains do not always increase as the power of the pre-trained model increases.

Effect of Unlabeled Data Size on Performance
We conduct experiments to understand the impact  of using different amounts of unlabeled data. We start with no unlabeled data, and then gradually increase the amount of unlabeled data. As shown in Figure 3, the performance increases significantly until the amount of unlabeled data is 10k, and then increases slowly. Thus, we can conclude that using a large amount of unlabeled data can facilitate the performance improvement, but the improvement slows down gradually. Case Study We present the predictions of the models on three random examples in Table 9. We can see that our method indeed corrects the predictions of the baseline. In addition, we discover that over-correction (e.g., staff person→staff ) and under-correction (e.g., pie company→pie) problems occur with the conventional self-training method, which we attribute to the introduction of too much noise. The cases on pseudo-labeled data are available from Table 14 in Appendix.
Error Analysis We examine the log files and classify the error predictions into three categories (under-prediction, over-prediction, and boundary errors).
We show examples of each category in  these files, we find that two methods yield some similar errors, suggesting that hard samples are indeed difficult to predict. In addition, we observe a higher percentage of over-prediction than that of the other two types, which may be the underlying reason for the higher recall than precision (

Conclusion
In this paper, we focus on the problem of insufficient labeled data in ATE, and try to solve it via selftraining. To mitigate the noise in pseudo-labels, we make two efforts. (i) motivated by curriculum learning, we refine the conventional self-training to progressive self-training, expecting to reduce the generation of noisy pseudo-labels; (ii) we introduce a discriminator to filter the noisy pseudo-labels. Experimental results show that our method beats the baselines and achieves SoTA performance. Moreover, we verify its effectiveness and generalization through extensive experiments.

A Appendix
Effect of the Number of Progressive Subsets on Performance In the above experiments, we split the unlabeled data into four progressive subsets (i.e., T = 4). Then a question may arise whether the number of progressive subsets has a significant impact on the method performance.
To probe this question, we divide the unlabeled data into different number of progressive subsets for comparison. As shown in Table 11  Effect of Different Incremental Magnitude on Performance In this paper, we set the incremental magnitude to 1k for simplicity, i.e., D u/i = i * 1k. We assume that an excessive incremental magnitude should have a positive impact on the model performance in that the progressive subsets are not increasing in size once the magnitude drops to zero. The experimental scores in Table 12 validate our assumptions. Overall, the subset with larger incremental magnitude boosts the model performance more compared to that with smaller incremental magnitude.  Fine-grained Named Entity Recognition Experiments To demonstrate our method can be ap-plied to other sequence labeling tasks, we experiment on the fine-grained named entity recognition task (Xu et al., 2020). We consider the first 1k samples in the original training set as labeled data and the rest of the samples (9,747) as unlabeled data. Table 13 shows that our method also has advantages over conventional self-training (71.94% vs 69.64%) on the fine-grained named entity recognition task, which confirms the generalizability of our method.  Table 13: Comparison (F1 scores) on the fine-grained named entity recognition task (validation set). Here, bert refers to Chinese BERT base ; * indicates the use of all annotated data; ST denotes the conventional selftraining method. The best scores are in bold and the second-best scores are in italics.
Comparison of the Parameter Amount and the Computational Complexity The significant time cost of our method is mainly attributed to two aspects: ATE model and discriminator need to be retrained after each subset is used (line 10 and 11 of Algorithm). For clarity of exposition, we conduct relevant experiments on the Res15 dataset and 10k unlabeled data. The parameter amounts for our method and the conventional self-training (ST) method are 218M and 109M , respectively. The main reason for this large difference is that our method includes a discriminator to filter out noise in the pseudo-labels. In addition, training our method and ST method requires 56min and 17min respectively (both have the same hyper-parameters).
We can see that our model takes several times as many hours as the ST method because of requiring retraining the baseline several times. However, it is worth noting that both take the same time during the inference phase. This is because our method involves only ATE model in practical inference. For example, both our method and ST method take 4s to infer Res15 test set (685 samples).

Unlabeled Data Sentence
Du they look good and stick good ! i just do n ' t like the rounded shape because i was always bumping it and siri kept popping up and it was irritating .
these stickers work like the review says they do . however , i ordered these buttons because they were a great deal and included a free screen protector . especially having nails , it helps to have an elevated key . these make using the home button easy . people ask where i got them from it ' s great when driving .
battery charges with full battery lasts me a full day . easy access to all buttons and features , without any loss of phone reception . it is a genuine blackberry charger .
(a) Examples of pseudo-labeled data of the conventional self-training method. Unlabeled Data Subset Sentence Discriminate D u/1 the igo bluetooth keyboard works great . good headset , good sound , great price . good headset , good sound , great price . good headset , good sound , great price .
D u/2 no issues at all with this battery order . it has loud speakers and eliminates background noises .
you can turn the ear piece off then power on to answer in order to keep your fav ring tone . this thing works good , but its not all fireworks and hotel parties . D u/3 also i have n ' t had any complaints from other friends i ' ve talked to with the headset .
i have owned 2 of these and its the best bluetooth i have used . there is a difference in usb cables . it does nothing to improve your signal . D u/4 battery lasts a couple of weeks without recharging .
it was comfortable and transmission was good . i never leave home without my ipad and this most useful stylus . this is a nice charger but you can tell it was made cheaply in china .
(b) Examples of pseudo-labeled data of our method. The last column is the results of the discriminator. Table 14: The phrase with color indicates the pseudo-labeled aspect terms; The green and red (manual inspection) indicate correct and incorrect pseudo-labels respectively.