PTST-UoM at SemEval-2021 Task 10: Parsimonious Transfer for Sequence Tagging

This paper describes PTST, a source-free unsupervised domain adaptation technique for sequence tagging, and its application to the SemEval-2021 Task 10 on time expression recognition. PTST is an extension of the cross-lingual parsimonious parser transfer framework, which uses high-probability predictions of the source model as a supervision signal in self-training. We extend the framework to a sequence prediction setting, and demonstrate its applicability to unsupervised domain adaptation. PTST achieves F1 score of 79.6% on the official test set, with the precision of 90.1%, the highest out of 14 submissions.


Introduction
SemEval-2021 Task 10 presents source-free unsupervised domain adaptation (SFUDA) for semantic processing. 2 The goal of unsupervised domain adaptation is to transfer a model from a source domain to another, different domain -called target domain -using only unlabelled data in the target domain. Source-free unsupervised domain adaptation additionally assumes no access to source domain data: only the source model pre-trained on the source domain data is available. This situation may occur when the source domain data contains protected information that cannot be shared, or even if it can, requires signing a complex data use agreement. While there are numerous works on SFUDA outside NLP (Hou and Zheng, 2020;Kim et al., 2020;Yang et al., 2020), SFUDA research for NLP is severely lacking in spite of its importance in, for example, clinical NLP (Laparra et al., 2020). There are two tasks involved in SemEval-2021 Task 10: negation detection and time expression recognition. We participate only in the latter.
Our approach is an extension of the parsimonious parser transer framework (PPT; Kurniawan et al. (2021)). PPT allows cross-lingual transfer of dependency parsers in a source-free manner, requiring only unlabelled data in the target side. It leverages the output distribution of the source model to build a chart containing high probability trees for each sentence in the target data. We extend this work by (1) formulating PPT for chain structures and evaluating it on a semantic sequence tagging task; and (2) demonstrating its effectiveness in a domain adaptation setting. We call our method Parsimonious Transfer for Sequence Tagging (PTST).
We find PTST effective for improving the precision of the system in the target domain. It ranks 7th out of 14 submissions in the official leaderboard in terms of F 1 score, but 1st in precision, with a gap of 3 points from the second best. Drawing on the model calibration literature, we provide a way to combat the problem of model overconfidence which is key to make PTST outperform a simple transfer of a source model to the target domain. However, we also find that PTST struggles in improving recall. In conclusion, our results suggest that PTST can be used for SFUDA, but further work is required to improve the precision-recall trade-off in the target domain.

Background
In the SemEval-2021 Task 10 time expression recognition task, the input is a single sentence, and the output is a sequence of tags indicating the time entity type of a word (if any). There are 32 time entities in total (e.g., Year, Hour-of-Day), and the tags are coded in BIO format. Fig. 1 shows an example input sentence and its output. The task organisers provided a pre-trained source model for the task. This source model is RoBERTa-base (Liu et al., 2019) that is pre-trained on more than 25K time expressions in English clinical notes from Mayo Clinic in SemEval-2018 Task 6. The model is distributed online via HuggingFace Models, 3 which can be obtained with HuggingFace Transformers library. 4 The organisers also released trial data for the practice phase containing 99 annotated English articles from the news domain. The official test data released by the organisers in the evaluation phase contains 47 articles that are in a different domain from the source and development data. The time expression recognition task is formalised as a sequence tagging task. The literature on sequence tagging in NLP is massive (Jiang et al., 2020;He and Choi, 2020;Rahimi et al., 2019;He et al., 2019;Xie et al., 2018;Clark et al., 2018, inter alia). One closely related task is named-entity recognition (NER) whose goal is to detect mentions of named entities such as a Person or Organisation in an input sentence. Lample et al. (2016) introduced a now widely adopted neural architecture for this task, where input word embeddings are encoded with a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) before they are passed through per-word softmax layers. In recent times, it is common to replace the LSTM with a Transformer (Vaswani et al., 2017). With the advancements of large pretrained language models, the standard is to use a model such as BERT (Devlin et al., 2019) as the encoder and fine-tune the model on labelled data. The source model of the time expression recognition task provided by the organisers was also trained in this manner.
In the context of unsupervised domain adaptation, a popular approach is domain adversarial training. Introduced by Ganin et al. (2016), it leverages multi-task learning in which the model is optimised not only on the main task objective but also a domain prediction objective. The learning signal for the latter is passed through a gradient reversal layer, ensuring that the learnt parameters are predictive for the main task, but general across domains. With pre-trained language models, Han and Eisenstein (2019) proposed to continue pre-training on the unlabelled target domain data, prior to finally finetuning on the labelled source domain data. Unfortunately, these approaches require access to the source domain data.
There is relatively little work on SFUDA in NLP, however, some works on source-free cross-lingual transfer exist. Wu et al. (2020) employ teacherstudent learning for source-free cross-lingual NER. A teacher model trained on the source side predicts soft labels on the unlabelled target side data, and a student model is trained on those soft labels. Their method outperforms a simple direct transfer method where the source model is directly applied on the target side. More recently, a method for source-free cross-lingual transfer of dependency parsers was introduced by Kurniawan et al. (2021).
The key idea is to build a chart of high probability trees based on arc marginal probabilities for each unlabelled sentence on the target side, and treat all those trees as a weak supervision signal for training. Their method outperforms direct transfer as well as a variety of recent cross-lingual transfer methods that are not source-free. That said, the effectiveness of their method on (a) semantic (sequence labelling) tasks and (b) in a domain adaptation setting is unexplored, which is what we aim to address in this work.

System Description
We first describe our sequence tagging model (Section 3.1), before we present parsimonious transfer for sequence tagging (PTST) in Section 3.2.

Model
Our model is a linear-chain conditional random field (CRF) over tag sequences. It assigns a score s(x, y) to a pair of input sentence x and output tag sequence y, which can be expressed as where π(x, j, t) is the emission score of word x j having tag t and φ(t, t ) is the transition score of having tag t followed by tag t . The probability of  Figure 2: Illustration of our method. Given an unlabelled sentence x from the target domain, we build the set of high probability tag pairsÃ(x) using the source model, which may contain correct tag pairs that do not occur in the predicted tag sequence (in orange). From these tag pairs, we build the chartỸ (x) containing tag sequences that can be assembled from tag pairs inÃ(x). The single predicted tag sequence from the source model (bottom) is also included in the chart, but it may contain an incorrect tag (in red) as it is noisy.
y given x is then The emission score function π is parameterised by a neural network whose parameters are initialised with the source model. Specifically, the emission score function is the RoBERTa model provided by the task organisers. The transition score function φ is a T × T parameter matrix that is learned during training, where T is the number of tags. Note that with dynamic programming, we can efficiently compute quantities such as the marginal probabilities of tag pairs P ((j, y j , y j+1 ) | x) or the partition function Z(x) = y∈Y(x) exp s(x, y) where Y(x) is the set of all possible tag sequences for x.

Unsupervised Adaptation
To perform unsupervised adaptation on the unlabelled target domain data, we extend PPT, our past work for source-free cross-lingual parsing (Kurniawan et al., 2021), to chain structures. Given a set of unlabelled sentences D in the target domain, we build a chart of high probability tag sequences Y (x) by leveraging the output distribution in the source model. The model then treats all sequences with sufficiently high predicted probability as possible tag sequences for x for training. Concretely, it minimises the loss: where θ denotes the target model parameters. The setỸ (x) is defined formally as where Y(x) is the set of all possible tag sequences for x, A(y) = {(j, y j , y j+1 )|1 ≤ j < |y|} is the set of consecutive tag pairs in tag sequence y, and A(x) = jÃ (x, j) whereÃ(x, j) is the set of high probability consecutive tag pairs (j, t j , t j+1 ) for words x j and x j+1 (see Fig. 2 for illustration). Analogous to PPT, this set is constructed by adding tag pairs (t j , t j+1 ) in order of decreasing marginal probability until their cumulative probability exceeds a threshold σ. The predicted tag sequence from the source model is also included inỸ (x) so the chart is never empty. Note that the method above is very similar to self-training where the predictions from the source model are used as supervision signal for training. In contrast to self-training, however, we build a chart of high probability predictions for each sample instead of just a single best prediction. We expect these predictions to be more useful than a single best prediction because it is more likely for the correct tag sequence to be in the chart than equal to the single best prediction. Even when this is not the case, we expect the partially correct tag sequences occur frequently enough in the chart so the model is still able to learn what the correct tag sequence is. In our preliminary experiments, we find that it is crucial to introduce temperature scaling to the emission scoring function in order to achieve good performance. Thus, for our main result, we define a new emission scoring function π as π (x, j, y j ) = π(x, j, y j )/τ where τ is the temperature scale hyperparameter. We discuss and provide an analysis of this temperature scaling in Section 5.

Experimental Setup
We use the pre-trained source model and data provided for SemEval-2021 Task 10. We only use the model and data for the time expression recognition task (TIMEX hereinafter) as we only participate in that task. We use the practice data as the development set to tune the hyperparameters of our model with random search. 5 We set the threshold σ = 0.95 following the setup of Kurniawan et al. (2021). We do not use any data sets other than those provided by the task organisers for TIMEX. We train PTST on the unlabelled test data for 5 epochs. As described in Section 3.1, we initialise the neural network for the emission scoring with the source RoBERTa model provided by the task organisers. We enforce the BIO constraints by initialising the transition matrix φ with −∞ for entries corresponding to illegal transitions, and zero otherwise. 6 We use the linear CRF implementation provided in the Torch-Struct library (Rush, 2020). To avoid out-of-memory error, we discard 5 Best learning rate and τ are 9 × 10 −6 and 2.56 respectively. 6 The constraints require an inside tag be always preceded by an inside or beginning tag of the same entity. sentences longer than 30 tokens. Additionally, we find that it is useful to freeze the embedding and the first few layers of the RoBERTa encoder. Thus, in our main result, we freeze the embedding layer and the first 6 (out of 12) layers of the encoder. An analysis is provided in Section 5. Table 1 shows the TIMEX leaderboard on the official test data in the evaluation phase. 7 Our model PTST is ranked 7th out of 14 submissions in terms of F 1 score, below the baseline model submission by the task organisers which ranks 4th. This baseline model is the pre-trained source model finetuned on the labelled development data. Despite the relatively low F 1 score, PTST achieves 90.1 % precision, which is the highest among all submissions, and markedly above the second highest precision of 87.3 % achieved by the second best performing model. Looking at recall, our model has the second lowest score of 71.3 %, which is fairly below the third lowest one of 73.2 %. This result suggests that our model is sacrificing recall in favour of precision, which may be a desirable property for downstream tasks where making the right prediction is more critical.

Results and Discussion
Model overconfidence As mentioned in Section 3.2, in our preliminary experiments, we find that the source model is extremely confident about its predictions, making the marginal probability distribution of tag pairs at any position j very sharp. This sharpness results inỸ (x) containing mostly just a single tag sequence, which is the predicted sequence from the source model, rendering the whole approach no different from simple self-training. To remedy this problem, we introduce temperature scaling in the emission score, which has been shown to be a simple but effective trick in model calibration (Geman and Geman, 1984;Guo et al., 2017). We define the new emission scoring function as shown in Eq. (5). Table 2 shows how the performance of PTST changes when τ is varied. We see that as τ increases, precision does too, but recall decreases, although in a relatively slower rate so the F 1 score tends to increase as well. Also reported in Table 2 is the median number of tag sequences and the fraction of gold tag sequences 7 Also available on https:// machine-learning-for-medical-language. github.io/source-free-domain-adaptation/ leaderboard.  Table 2: Model performance on the development data as τ changes. SRC is the pre-trained source model directly applied on the development data. F 1 , precision (P), and recall (R) scores are averages (± std) over 3 runs. n is the median number of high probability tag sequences in the chartỸ (x). p is the fraction of gold tag sequences contained in the chart.  contained in the chart. The two quantities grow as τ does, which indicates that increasing τ indeed allows the chart to contain more tag sequences, and thus increasing the coverage of correct tag sequences in the chart. However, when τ is too large (τ = 3.0), the model breaks down, presumably be-causeỸ (x) contains too many noisy tag sequences to be useful.
The decline in recall might be explained by the nature of the task, where in a single sentence most of the words are not time entities. When τ grows, the number of high probability tag sequences iñ Y (x) does too. In the majority of these tag sequences, a word in x is likely to be tagged as a non-entity because time entities are naturally rare. Since tag sequences are treated uniformly (i.e. no tag sequence weighs more than the others), this provides a strong signal for the model that the word is a non-entity. Therefore, the model's capability of recognising entities is reduced. Conversely, a similar argument may explain the rise in precision. When the model predicts a word as an entity, it is likely that in the majority of tag sequences inỸ (x), the word is tagged as the same entity, providing a strong signal that the word is indeed that entity. In other words, if the model predicts an entity, the model is very confident about it. When confidence is high, it is more likely that the prediction is correct, thus resulting in higher precision.

Freezing layers
We also find that it is helpful to freeze the embedding layer and the first few layers of the RoBERTa model's encoder during training, presumably because they encode low-level linguistic information that is invariant across domains. Table 3 reports how the model performance changes with varying numbers of layers frozen (τ is fixed to 2.5). We observe that freezing the embedding and first several encoder layers gives a small boost to performance, with best performance reached with 6 frozen layers (the setting adopted in the model reported in the main results).
Error analysis To better understand the errors of PTST, we present the confusion matrix of the model on the test data in Fig. 3. We see that the majority of the errors arise from the model not recognising actual time entities, consistent with the relatively low recall. The model has serious difficulties in recognising Season-Of-Year, for example, in fragments like: The increase in food aid beneficiaries is partly attributed to Meher harvest loss [...] (1) The increase, which follows a seasonal trend, is seen in all regions except Tigray. (2) The model also seems to struggle with recognising Between and This. Example sentence fragments where the model wrongly predicts a as Period, while the word is actually not a time entity.

Conclusions
We present PTST, our submission to the time expression recognition task of SemEval-2021 Task 10. We describe our sequence tagging model as a CRF over chain structures, parameterised by a neural network. Our domain adaptation approach leverages the output distribution of the source model to build a chart of high probability tag sequences for every sentence in the unlabelled target domain data.
PTST ranks 7th in terms of F 1 score in the official leaderboard, but achieves the highest precision out of 14 submissions. We provide analyses on the importance of temperature scaling to mitigate model overconfidence and the patterns of errors.