The University of Arizona at SemEval-2021 Task 10: Applying Self-training, Active Learning and Data Augmentation to Source-free Domain Adaptation

This paper describes our systems for negation detection and time expression recognition in SemEval 2021 Task 10, Source-Free Domain Adaptation for Semantic Processing. We show that self-training, active learning and data augmentation techniques can improve the generalization ability of the model on the unlabeled target domain data without accessing source domain data. We also perform detailed ablation studies and error analyses for our time expression recognition systems to identify the source of the performance improvement and give constructive feedback on the temporal normalization annotation guidelines.


Introduction
Unsupervised Domain Adaptation (UDA) is a task that generalizes knowledge acquired from a model trained on labeled data in one domain (source domain) to unlabeled data in a different domain (target domain). Conventional UDA algorithms usually require access to both source-domain and target-domain data (Ganin et al., 2016;Glorot et al., 2011;Chen et al., 2012;Louizos et al., 2016). However, sharing source-domain data is often not practical for clinical texts due to their highly sensitive personal information and complex data use agreement procedures (Laparra et al., 2020). To overcome this difficulty, Laparra et al. (2020) propose a new task of source free domain adaptation (SFDA) where only models trained on source-domain data are shared, which allows the possibility of using the information from the source-domain while reducing private information leakage. The biggest challenge of this task is to transfer task-related information embedded in the trained models.
Our team participated in both subtasks of Sem-Eval 2021 Task 10 (Laparra et al., 2021), Source Free Domain Adaptation for Semantic Processing: negation detection and time expression recognition.
For both tasks, participants were given a RoBERTa model (Liu et al., 2019) fine-tuned on the sourcedomain, and asked to make predictions in the targetdomain.
The goal of the negation detection task is to predict whether an event in a sentence is negated by its context. This is a binary sentence classification task. For example, given the event diarrhea and the sentence Has no diarrhea and no new lumps or masses, the goal is to predict that diarrhea is negated by its context. The goal of time expression recognition sub-task (Laparra et al., 2018) is to recognize time expressions in the target domain. This is a named entity recognition (NER) task. The number of entity types (inside-outside-beginning format) is 65. Entity types in this task are formally defined time entity types from the Semantically Compositional Annotation of Time Expressions (SCATE) (Bethard and Parker, 2016) annotation schema. For example, in 2021-02-19, 2021 will be labeled as Year, 02 will be labeled as Month-Of-Year and 19 will be labeled as Day-Of-Month.
We investigate self-training, active learning, and data augmentation techniques on negation detection and time expression recognition under the SFDA setting. Our contributions are: 1. We demonstrate that simple self-training over a small portion of the target domain data can effectively improve the performance of the negation detection model. 2. We demonstrate that active learning with data augmentation can significantly improve time expression recognition performance when selected examples are accurately annotated. 3. We perform ablation studies for the time expression recognition systems to analyze where the performance improvement comes from. 4. We analyze our annotation errors for the time recognition task and give constructive feed-back on the annotation guideline and schema.

System Description
The source-domain models for both subtasks are RoBERTa-base models with linear classification output layers, implemented via the Huggingface Transformers library (Wolf et al., 2020), using RobertaForSequenceClassification for negation, and RobertaForTokenClassification for time. The input to the models is a sequence tokenized by Byte-Pair Encoding (BPE). Following the conventions of the RoBERTa model input format, two special tokens <s> and </s> are inserted at the beginning and end of the sequence, respectively. In the negation detection task, targeted events are denoted with two special tokens <e> and </e> that are inserted before and after the event. For example, the sentence Has no diarrhea and no new lumps or masses with event diarrhea will be converted to <s>Has no <e>diarrhea</e> and no new lumps or masses.</s>. The model output for negation detection is whether the target event is negated and the model output for time expression detection is the labels for each input tokens.

Negation Detection System
We employ a simple self-training (Yarowsky, 1995) approach that fine-tunes the model with its own predictions on the unlabeled dataset. We start with the pre-trained source-domain model, M . Then, for each self-training iteration: 1. We initialize an empty training set, L. 2. We use M to label the target domain data.
3. If an instance is labeled with a probability above a threshold τ , we add it to L with the predicted label as its pseudo label. 4. We fine-tune M on L. When the source-domain model predictions are the same for two consecutive iterations or the number of iterations of self-training is greater than the predefined maximum number, self-training stops. Note that the training set L is reinitialized at each iteration, and the model is iteratively fine-tuned.

Time Expression Detection System
Our approach combines active-learning (Cohn et al., 1996) and data-augmentation (Simard et al., 2003). We start with the pre-trained source-domain model, M 0 , a copy of the pre-trained sourcedomain model, M , and initialize an empty training set L. Then, for each iteration: 1. We select the k instances where M is most uncertain, manually annotate them, and add them to L. (Details in section 2.2.1.) 2. We augment each manually annotated instance with n new examples and add them to L. (Details in section 2.2.2.) 3. We re-initialize M to M 0 and fine-tune on L. We repeat this process i times. Note that the training set L is built cumulatively, and M is reinitialized on each iteration.

Active Learning
We use active learning methods to manually label the most uncertain examples of the model in the target domain. We believe that it is not practical to manually label the entire target domain dataset during the test phase. This requires sufficient expertise and time from annotators (we show later that it is very difficult to understand annotation guidelines in a short time). Otherwise, low-quality annotations will hurt the performance of the model. In each iteration, we select the top k target domain sentences with the highest uncertainty scores to manually annotate. We define the uncertainty score of an example as the sum of the model's prediction's entropy for each token within the sentence. Manual annotation follows the SCATE annotation guidelines released by the organizers.
The annotators were the first two authors of this paper, a Linguistic PhD student and a Information PhD student. During the annotation process, we first individually annotated examples and then resolved annotation differences through discussion. Our first exposure to the SCATE annotation schema was approximately 10 days before the start of the test phase, when we began reading the guidelines and posting questions on the Google forum. We used gold annotations from the development set (on the news domain) to simulate the annotation process during the practice phase. We believe this is similar to most real-world SFDA situations, where the person applying the model on the target domain is unfamiliar with the annotation guidelines and has limited time to learn them.

Data Augmentation
Inspired by Miao et al. (2020), we applied data augmentation to increase the size of our training set beyond what can be achieved by manual annotation, and to improve the generalization of the model. For each time entity that we manually annotated, we automatically generated new training examples by

Data
All data used is in English. Both subtasks had training, development and test data, each representing different domains. As participants, we did not have access to the training set. The training sets are used by organizers to fine-tune the pretrained RoBERTa-base models to obtain the sourcedomain models. We used the source-domain models and development sets to develop source-free domain adaptation systems during the practice phase, and tested our systems during test phase. We summarize the data in table 1.

Experiments
The organizers provided two baseline models for each task: the source-domain model, and the source-domain model fine-tuned on the development set. The official evaluation metric is the F1 score. Precision and recall scores are also reported.

Negation Detection
In the testing phase, we first fine-tuned the sourcedomain model on the labeled development set. Although the domains of the development set and  the test set are different, they are both clinically relevant data, so we believed that fine-tuning the model on the development set could improve its performance on the test set. Because of time and hardware constraints, we randomly sampled 3000 instances from the 622,703 test set instances as unlabeled data for self-training. We used the same hyperparameters for fine-tuning the source-domain model on the development set and self-training the fine-tuned model on the randomly sampled test data. All the hyperparameters are shown in table 4 in appendix A.2. Our submission ranked 2 nd . Table 2 shows that our system outperformed both baseline models provided by the organizers.

Time Expression Recognition
We did not fine-tune the source-domain model on the development set during the test phase. The development set is from the newswire domain, while the test set is from the food security domain. We though that there might be a large difference between these two domains. Fine-tuning on a different domain may hurt the performance of the model on the test set. As with the code provided by the organizer, we used the sentencizer from Spacy (Honnibal et al., 2020) to split the input documents into sentences and used them as inputs to the model.  appendix A.2. Our submitted system ranked 6 th . Table 3 shows that our submitted system's performance (row 3) is no better than the best baseline model (row 2) provided by the organizers.
To investigate the reasons for the lower-thanexpected test performance, we used the gold annotations in the test set for our post-evaluation runs (row 5-11 in table 3). Note that performance for these rows will be artificially inflated, since up to 160 of the 926 test sentences were included in the system's training data. Nonetheless, we see that by using the gold annotations instead of our manual annotations (row 5 vs row 3 in table 3), the performance of our system improved by 0.160 F1 score. This seems to suggest that our system can improve its generalization ability if we can accurately label the target domain data.
We further analyze where the performance improvement comes from in section 5 and provide a detailed analysis of our annotation errors and give feedback on annotation guidelines in section 6.

Time Expressions Ablation Study
Effect of Fine-Tuning on Dev Data From the baseline models' performances (row 1 vs row 2 in table 3), we can see that the test performance of the model fine-tuned on the development set is slightly better than the pure source-domain model (+.010 F1 score). To verify if this is also true for our active learning system, we add the fine-tuning strategy to our system (row 6 in table 3) and run the system on the labeled portion of the test set. The results (row 5 vs row 6 in 3) indicate that fine-tuning on the additional domain continues to help a bit (+.004 F1 score) even when followed by active learning.

Effect of Data Augmentation
We also investigate the contribution of our data augmentation strategy, removing it from our system and running on the labeled test set. The result shows that data augmentation brings a +.065 F1 score improvement to our system (row 7 vs row 5 in table 3). This indicates that data augmentation was a major source of performance improvements.

Effect of Size of Annotation Data
In real-world use cases, we often want to keep the size of annotated data as small as possible, since annotation is time consuming and error-prone. To understand how our system performs with less manual annotated examples, we reduce the number of sentences to be annotated at each active learning iteration to 16, 8 and 4 resulting in rows 8, 9 and 10 in table 3. The results show that with only 20 correctly annotated sentences but incorporating data augmentation (row 10), our system outperform the best baseline model (row 2) by .073 F1. If we remove data augmentation from this model (row 11) its performance declines, but still outperforms the best baseline model by .047 F1.

Time Expressions Annotation Analysis
Though gold annotations led to large performance improvements, the annotation for this task is challenging for untrained people. Through reading the SCATE annotation guidelines and posting questions on the share task google group, our team annotated 160 sentences of which 48 sentences were in the labeled portion of the test set. We annotated 13008 tokens in total (including padding tokens) and our overall accuracy on the gold 48 sentences is 0.991 for all categories and 0.785 excluding the category O. We report detailed performance for each of the entity types in table 6 in appendix A.3.
We found several annotation patterns where our team consistently disagreed with the gold annotations. Our errors can be broadly attributed to two reasons: misinterpretation/underspecification of annotation guidelines, and ambiguity of the phrases.
Errors from misinterpretations/underspecification of annotation schema : We annotated the token seasonal(ly) (e.g., seasonal progress, seasonal rainfall) as Calendar-Interval instead of Season-Of-Year as we thought Season-Of-Year is applied to seasons that are explicitly specified (such as summer). We considered seasonal similar to weekly, both referring to an interval unspecified. However, Season-Of-Year could be applied to very broad categories such as dry seasons and rainfall seasons including seasons that are not specified. Also, seasonal unlike weekly/monthly/yearly only refers to one season of a year instead of every season of a year. Due to the ubiquity of this token in the dataset, this error affects our overall performance. Correcting the annotation of this particular token leads to +.042 F1 score improvement (row 4 vs row 3 in table 3). Another erroneous pattern is that we double-annotated the phrase such as from . . . to . . . and between . . . to . . . . Specifically, we annotated both adpositions instead of choosing the first adposition only. Finally, we also annotated more modifiers than the gold annotations. For instance, we annotated marketing in marketing year and long in long dry Jiaal Season as 'Modifier' instead of the category 'O'. It turns out that the category of modifier in the gold annotation is a closed category, and only a specific set of tokens are considered modifiers.
Errors from ambiguity within phrases Some phrases allow multiple interpretations that lead to different ways of annotations. For instance, con-fusion between 'Period' and 'Calendar Interval' occurred frequently (e.g., we annotated weeks in recent weeks as 'Period' rather than 'Calendar-Interval'). Although "/" between seasons is commonly annotated as 'I-Season-Of-Year' in the gold annotations, we found different roles it might play in specific contexts. For example, if "/" is used between two terms that refer to the same season, then it should be annotated as 'I-Season-Of-Year'; if it is used between two non-adjacent seasons, it should be annotated as 'Union'; and if it is used between two adjacent seasons, then it could be annotated as 'Between' or 'Union'. Thus, the correct annotation requires a surprising amount of external knowledge about Ethiopian season terms. In fact, there are still cases that remained uncertain: For example, Xaran refers to seasonal rains from April through September and Xagaa refers to the second dry season (July to September) and when the two tokens joined by "/" it is difficult to interpret the meaning of "/". We also found the conjunction and causes ambiguity. For example, and in rains in May and August could be considered as an operator over months (i.e., rains in Union(May, August)) or an operator over rains (i.e., Union(rains in May, rains in August)). The former understanding requires annotating and whereas the latter does not despite the fact that the two interpretations are essentially semantically equivalent. Lastly, we also found the particle the is difficult to annotate. For instance, the in the month depending on the context may be annotated as this or last and sometimes the context may not be clear enough to tell the differences.
Our annotation analysis leads to several suggestions for the annotation schema and the documentation. Our errors in the first category indicate some potential helpful updates can be made such as including more examples in categories (e.g., 'Season-Of-Year'), explicit documentation of whether the certain category is closed or open as well as the specific manner to deal with multi-word phrases or even circumpositions. The second category of errors, however, might involve the refinement of the annotation schema. For example, maybe 'Between' and 'Union' can be unified together, and 'Period' can be merged into 'Calendar Interval' or confined to an explicit set of circumstances.

Conclusion
Our overall rank (by F1 score) for negation detection task was 2 nd and for time expression recogni-tion was 6 th . Our results suggest that simple self-training can be used in sentence-level SFDA tasks to improve a trained model's performance on a new domain. As for token-level tasks, our analysis shows that both active learning and data augmentation can bring significant performance improvements, but the premise is that the data in active learning can be correctly annotated. Our analysis and feedback could also be used to improve the SCATE annotation guidelines/schema in future work.