Self-Adapter at SemEval-2021 Task 10: Entropy-based Pseudo-Labeler for Source-free Domain Adaptation

Source-free domain adaptation is an emerging line of work in deep learning research since it is closely related to the real-world environment. We study the domain adaption in the sequence labeling problem where the model trained on the source domain data is given. We propose two methods: Self-Adapter and Selective Classifier Training. Self-Adapter is a training method that uses sentence-level pseudo-labels filtered by the self-entropy threshold to provide supervision to the whole model. Selective Classifier Training uses token-level pseudo-labels and supervises only the classification layer of the model. The proposed methods are evaluated on data provided by SemEval-2021 task 10 and Self-Adapter achieves 2nd rank performance.


Introduction
Domain adaptation (DA) is the task of applying an algorithm trained on a source domain data to a different target domain data with limited/undefined labels. DA has gotten significant attention as an alternative of fine-tuning approach (Ganin and Lempitsky, 2015;Saito et al., 2018;Tzeng et al., 2017), especially in situations rich supervision is not possible (Morerio et al., 2018). DA is an important way of overcoming the data shortage of deep learning since it enables the utilization of knowledge from other labeled data.
Source-free DA is then proposed to cope with such data sharing in the general setting of DA, the data distribution in the source domain and the target domain are related but different (Storkey and Sugiyama, 2007), and annotated samples from the source domain are available during the training process. However, many of the data resources are not allowed to be shared in real-life environments as there are increasing concerns for privacy issues. For example, Twitter has a regulation that prevents sharing tweet text. The policy is even more rigorous in the financial/clinical domain under the privacy protection issue.
Unlike conventional DA, one can not get access to the source domain data in source-free DA but is provided a model trained on the source domain data. About source-free DA in computer vision, several approaches have been proposed; (Sahoo et al., 2020) assumes the target domain data is a transformation from the source domain data along natural axes such as brightness and contrast; (Kundu et al., 2020) proposes universal DA that is trained via two-stage learning of procurement and deployment; (Kim et al., 2020) progressively updates the target model with pseudo-labels which are selected under self-entropy criterion.
As for natural language processing (NLP), the application of source-free DA is slightly more complicated since sentences are usually considered as having discrete representations. In this context, SemEval-2021 task 10 has proposed a challenge that is related to source-free domain adaptation for semantic processing.
In this paper, we propose Self-Adapter for the time expression recognition sub-task in SemEval-2021 task 10. Following (Kim et al., 2020), we employ pseudo-labels from the target domain to further supervise the model trained on the source domain data, while the entropy-based evaluation of reliable pseudo-labels is adopted in consideration of the discrete text data. In addition, we adopt Sloughing trick to prevent over-fitting.
To demonstrate the efficacy of the Self-Adapter, we evaluate the proposed method on the dataset by (Laparra et al., 2018). We also compare the proposed method with several variations and another method we come up with, named Selective Classifier Training (SCT). In the end, the Self-Adapter has achieved 2 nd rank in the official evaluation period getting 0.811 F 1 which is 1.7 percentage points higher than the RoBERTa-based sequence tagging model pre-trained only on source data.

RoBERTa
Sentence-level self-entropy filtering Sentence-level self-entropy filtering " ) , " ) " * , * + C C C Fixed Trainable Trainable Sloughing Figure 1: Training pipeline of Self-Adapter. At the beginning of every stage of training, RSM is initialized with 'reliable' samples generated by both fixed and trainable models. A trainable model is supervised using samples -source-oriented pseudo-labels and targetoriented pseudo-labels -stored in RSM.

Systems Description
Our proposed methods have three operations in common: (1) generating pseudo-labels, (2) filtering out pairs of reliable samples and pseudo-labels based on models' self-confidence, and (3) doing supervised learning using the pseudo-labels. We concentrate on sorting out 'reliable' pseudo-labels since training with incorrect labels harms the performance of the model. Self-entropy is usually treated as an indicator of self-confidence (Zou et al., 2018;Saporta et al., 2020). We adopt normalized self-entropy as the evaluation metric for pseudo-labels: where x t denotes each token that makes up a sentence X ∈ X. l(x t ) denotes the predicted probability of the predicted label by the classifier, and N c refers to the total number of labels. Specifically, we propose two adaptation methods to efficiently fit the model trained on a source domain to a target domain: Self-Adapter and SCT.

Method 1: Self-Adapter
We propose Self-Adapter which is a self-learning method under the supervision of reliable sample memory (RSM). RSM is a set of data with pseudolabels that consists of two parts, source-oriented pseudo-labels and target-oriented pseudo-labels, and each of them represents the knowledge learned from the source domain and new features to learn from the target domain. We further apply a trick called 'Sloughing' which helps prevent over-fitting. The overall workflow of Self-Adapter is shown in Figure 1.

Reliable Sample Memory
RSM is the pairs of input sentences and the corresponding pseudo-labels obtained from a Siameselike network structure. Two RoBERTa-based (Liu et al., 2019) classifiers are initialized with a RoBERTa-based sequence tagging model finetuned only on source train data, which is given as a baseline model in the task. One of the branch maintains fixed weight parameters while another is fine-tuned during training.
Both branches of the network take a target domain sentence X as an input and output a set of probabilities for labels each token should be assigned to. We utilize the self-entropy as the evaluation metric for the self-confidence of each token. If the self-confidence of each token is smaller than the predefined threshold, the pair of input sentences with the pseudo-labels generated by the model is kept as a part of RSM.
The fixed part of the network consistently outputs the same pairs (X s ,Ŷ s ) which are called source-oriented pseudo-labels. The trainable part of the network outputs different pairs (X t ,Ŷ t ) called target-oriented pseudo-labels after each update and both are stored in the RSM. All sentencelabel pairs in RSM, both source-oriented pseudolabels and target-oriented pseudo-labels, are used to train the trainable part of the network in a supervised manner. We call the cycle in which RSM are updated as a stage and each stage is composed of several epochs.

Sloughing trick
After sufficient update of RSM, we generate pseudo-labels with RSM and do another selfentropy filtering to gain new reliable samples. Subsequently, we re-initialize the trainable part of the network with the parameter of the baseline and train it under the supervision of the new reliable samples. We call this procedure Sloughing. Since many of the reliable samples in each RSM update overlaps, over-fitting tends to happen over time.
The Sloughing then efficiently prevents over-fitting by newly initializing a model which is not fitted to test data yet. Training. RTM is updated at the start of each step of training. In multi-branch network whose two branches C s2t and C t share a fixed RoBERTa-based feature extractor, loss of C s2t branch is calculated by supervision with source-oriented token-wise pseudo-labels and loss of C t branch is calculated by supervision with targetoriented token-wise pseudo-labels.

Method 2: Selective Classifier Training
Selective Classifier Training is a training method that consists of a RoBERTa-based feature extractor and multi-branch classifiers. The feature extractor and classifiers are initialized with the RoBERTabased sequence tagging model fine-tuned only on source train data, given as baseline model in the development phase. In SCT, only the classifiers are updated under the supervision of Reliable Token Memory (RTM).

Reliable Token Memory
RTM is the pairs of tokens and their pseudolabels obtained from a network with two separate branches. Both the branches share the fixed feature extractor which is the same with the feature extractor of the SCT training pipeline. Two classifiers, a trainable classifier C t and a fixed classifier C s , make predictions on contextual embedding passed from the feature extractor.
To update RTM, we first get contextual embeddings for all tokens in target domain sentences by putting in all sentences as input of shared feature extractor and we get prediction on each token embeddings. Token embeddingsf i whose normalized self-entropy predicted by C t are lower than the threshold θ are called reliable token-wise samples. The pseudo-labels of reliable tokenwise samples predicted by C s are called sourceoriented token-wise pseudo-labels. The pseudolabels of reliable token-wise samples predicted by C t are called target-oriented token-wise pseudolabels. The pairs of reliable token-wise samples and their source-oriented token-wise pseudo-labels, and the pairs of reliable token-wise samples and their target-oriented token-wise pseudo-labels consists RTM.

Multi-branch network
With RTM, we train a multi-branch network in which each branch shares a fixed feature extractor. They divide into two classifiers C s2t and C t . Loss of C s2t branch is calculated by supervision with source-oriented token-wise pseudo-labels and loss of C t branch is calculated by supervision with target-oriented token-wise pseudo-labels. RTM updates at the start of each step of training.
The loss function is formulated as where L s2t and L t indicates loss function of C s2t branch and C t respectively. α is a weight between two branches. We gradually increase α from 0 to 1 to deal with high instability in the early stages of learning, in the same way as (Kim et al., 2020). In the test phase, we use the classification probability of the C t branch.

Experiments
We evaluate our two models: Self-Adapter, SCT and their variations. The baseline on the development data is a RoBERTa-based sequence tagging model pre-trained on only the source data: de-identified clinical notes from the Mayo Clinic, called Source-Trained. Also, there is another baseline Dev-Tuned on the test data which is the source pre-trained model (i.e., Source-Trained) fine-tuned on the labeled development data. The development data is the annotated news portion of the SemEval-2018 Task 6 data. Test data is a set of annotated documents extracted from food security warning systems. development data consists of 1580 sentences and test data consists of 3911 sentences. The total number of labels is 65, where label 0 indicates non-time entity, and label 1-64 indicates different types of time entities. The model submitted to the competition is marked with *. SA indicates Self-Adapter and SA+Sloughing is a system where Sloughing is applied on a model trained with Self-Adapter. SA-filtering is a system whose training pipeline is the same as Self-Adapter except that confidence filtering is not done. Source-Trained is a RoBERTa-based sequence tagging model pre-trained on only the source data: de-identified clinical notes from the Mayo Clinic, given as baseline model in the development phase of the competition.

Experimental setup
For all of our models, we set normalized selfentropy threshold θ = 0.1 except when applying Sloughing trick, on which θ = 0.01. We train Self-Adapter for 3 stages. Each iteration consists of 4 epochs with batch size 1 (sentence-level) and the learning rate is fixed as 5e-5. In Self-Adapter, pseudo-labels are updated at every stage. In Self-Adapter combined with Sloughing, we apply Sloughing for 3 times, 4 epochs training with batch size 1 (sentence-level) is done every time.
The learning rate is fixed as 5e-5. In SCT, pseudolabels are updated every epoch. We train 2 epochs with batch size 4 (token-level) and the learning rate is scheduled with inverse decay scheduler same as (Kim et al., 2020), with initial learning rate 5e-5. We use Adam optimizer in all models. Table 1 and Table 2 shows the performance of the proposed methods on development and test data respectively. Each method is evaluated with P recision, Recall, and F 1 . Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. Recall is the ratio of correctly predicted positive observations to all observations in the actual class. F 1 is the weighted average of Precision and Recall. Our major concern is F 1 , which is the most preferred indicator of accuracy in text classification tasks. On both data, Self-Adapter combined with Sloughing performs the best in F 1 and Self-Adapter performs the second-best. SCT does not provide  Table 2: F 1 , P recision, Recall on the test data. The model submitted to the competition is marked with *. SA indicates Self-Adapter and SA+Sloughing is a system where Sloughing is applied on a model trained with Self-Adapter. Source-Trained is a RoBERTa-based sequence tagging model pre-trained on only the source data: de-identified clinical notes from the Mayo Clinic and Dev-Tuned is a the source pre-trained model (i.e., Source-Trained) fine-tuned on the labeled development data.

Experimental results and analysis
significant improvement of F 1 compared to Self-Adapter. Self-Adapter without confidence filtering performs almost the same as Source-Trained on every evaluation metric.

Impact of confidence filtering
Our confidence filtering proves to be effective in dealing with the uncertainty of pseudo-labels. Self-Adapter, whose core is confidence filtering, increases 3.7, 1.6 percentage points of F 1 on development data and test data for each. The system whose training pipeline is the same as Self-Adapter except that confidence filtering is not done performs almost the same as the Source-Trained.

Necessity of training feature extractor
Well-trained BERT embeddings contain both syntactic (Hewitt and Manning, 2019) and semantic (Coenen et al., 2019) information of words. However, this is only when the model is fine-tuned with data from the domain same as the target domain. It is well known that embedding models trained on different domains poorly capture the domainspecific vocabularies and word semantics due to domain shift. (Sarma et al., 2018) Since RoBERTa is a BERT-based language model, the same issue arises on RoBERTa used in this task. Thus if the feature extractor used for embedding words is fixed during training, the embeddings obtained do not provide sufficient information to the classifier, resulting in a limitation to improving performance. This is also shown through experimental results in which Self-Adapter outperforms SCT.

Inefficiency of Sloughing
In Self-Adapter, the model learns from almost all sentences in development data and test data. Only 333 sentences out of 1580 and 917 sentences out of 3911 were filtered in development data and test data for each despite the high threshold we set (θ = 0.1). It affects the magnitude of the effect of the Sloughing in our method. Sloughing improves performance on both development and test data, but not enough to be taken as meaningful. 0.04 percentage points of F 1 on development data and 0.1 percentage points of F 1 on test data increase by application of Sloughing.
Somewhat discouraging effect of Sloughing is due to the setting of our task, in which training is done with almost all samples in test data, despite confidence filtering. We expect Sloughing to be more effective in the setting where the bigger proportion of samples are filtered and thus the ability for generalization on unseen data is more important. However, verification of these hypotheses will be carried out as a follow-up study.

Conclusion
In this paper, we propose novel training methods Self-Adapter and Selective Classifier Training that improve model performance on the target domain only by leveraging the RoBERTa-based model pretrained on source data. Both models rely on selflearning with highly credible pseudo-labels that are filtered based on self-entropy, differ only in the range of trainable parts. Also, we propose Sloughing trick to prevent over-confidence of the model by softening the network output. Our work is highly applicable in the real world since we have achieved remarkable improvement in performance using only a few test data which is not annotated at all, without any manual supervision.