Zero-shot Sequence Labeling for Transformer-based Sentence Classifiers

We investigate how sentence-level transformers can be modified into effective sequence labelers at the token level without any direct supervision. Existing approaches to zero-shot sequence labeling do not perform well when applied on transformer-based architectures. As transformers contain multiple layers of multi-head self-attention, information in the sentence gets distributed between many tokens, negatively affecting zero-shot token-level performance. We find that a soft attention module which explicitly encourages sharpness of attention weights can significantly outperform existing methods.


Introduction
Sequence labeling and sentence classification can represent facets of the same task at different granularities; for example, detecting grammar errors and predicting the grammaticality of sentences. Transformer-based architectures such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) have been shown to achieve state-of-the-art results on both sequence labeling (Bell et al., 2019) and sentence classification (Sun et al., 2019) problems. However, such tasks are typically treated in isolation rather than within a unified approach.
In this paper, we investigate methods for inferring token-level predictions from transformer models trained only on sentence-level annotations. The ability to classify individual tokens without direct supervision opens possibilities for training sequence labeling models on tasks and datasets where only sentence-level or document-level annotation is available. In addition, attention-based architectures allow us to directly investigate what the model is learning and to quantitatively measure whether its rationales (supporting evidence) for particular input sentences match human expectations. While evaluating the faithfulness (Herman, 2017) of a model's rationale is still an open research question and up for debate (Jain and Wallace, 2019;Wiegreffe and Pinter, 2019;DeYoung et al., 2020;Jacovi and Goldberg, 2020;Atanasova et al., 2020), the methods explored here allow for measuring the plausibility (agreeability to human annotators; DeYoung et al. (2020)) of transformer-based models using existing sequence labeling datasets.
We evaluate and compare different methods for adapting pre-trained transformer models into zeroshot sequence labelers, trained using only gold sentence-level signal. Our experiments show that applying existing approaches  to transformer architectures is not straightforward -transformers already contain several layers of multi-head attention, distributing sentencelevel information across many tokens, whereas the existing methods rely on all the information going through one central attention module. Approaches such as LIME (Ribeiro et al., 2016) for scoring word importance also struggle to infer correct token-level annotations in a zero-shot manner (e.g., it achieves only 2% F-score on one of our datasets). We find that a modified attention function is needed to allow transformers to better focus on individual important tokens and achieve a new state-of-the-art on zero-shot sequence labeling.
The contributions of this paper are fourfold: • We present the first experiments utilizing (pretrained) sentence-level transformers as zeroshot sequence labelers; • We perform a systematic comparison of alternative methods for zero-shot sequence labeling on different datasets; • We propose a novel modification of the attention function that significantly improves zero-shot sequence-labeling performance of transformers over the previous state of the art, while achieving on-par or better results on sentence classification; • We make our source code and models publicly available to facilitate further research in the field. 1

Methods
We evaluate four different methods for turning sentence-level transformer models into zero-shot sequence labelers.
2.1 LIME LIME (Ribeiro et al., 2016) generates local wordlevel importance scores through a meta-model that is trained on perturbed data generated by randomly masking out words in the input sentence. It was originally investigated in the context of Support Vector Machine (Hearst et al., 1998) text classifiers with unigram features. We apply LIME to a RoBERTa model supervised as a sentence classifier and investigate whether its scores can be used for sequence labeling. We use RoBERTa's MASK token to mask out individual words and allow LIME to generate 5000 masked samples per sentence. The resulting explanation weights are then used as classification scores for each word, with the decision threshold fine-tuned based on the development set performance. Thorne et al. (2019) found LIME to outperform attention-based approaches on the task of explaining NLI models. LIME was used to probe a LSTMbased sentence-pair classifier (Lan and Xu, 2018) by removing tokens from the premise and hypothesis sentences separately. The generated scores were used to perform binary classification of tokens, with the threshold based on F 1 performance on the development set. The token-level predictions were evaluated against human explanations of the entailment relation using the e-SNLI dataset (Camburu et al., 2018). LIME was found to outperform other methods, however, it was also 1000× slower than attention-based methods at generating these explanations.

Attention heads
The attention heads in a trained transformer model are designed to identify and combine useful information for a particular task. Clark et al. (2019) found that specific heads can specialize on different linguistic properties such as syntax and coreference. However, transformer models contain many layers with multiple attention heads, distributing the text representation and making it more difficult to identify token importance for the overall task.
Given a particular head, we can obtain an importance score for each token by averaging the attention scores from all the tokens that attend to it. In order to investigate the best possible setting, we report results for the attention head that achieves the highest token-level Mean Average Precision score on the development set.

Soft attention
Rei and Søgaard (2018) described a method for predicting token-level labels based on a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) architecture supervised at the sentence-level only. A dedicated attention module was integrated for building sentence representations, with its attention weights also acting as token-level importance scores. The architecture was found to outperform a gradient-based approach on the tasks of zero-shot sequence labeling for error detection, uncertainty detection, and sentiment analysis.
In order to obtain a single raw attention value e i for each token, biLSTM output vectors were passed through a feedforward layer: where e i is the attention vector for token t i ; h i is the biLSTM output for t i ; and e i is the single raw attention value. W e , b e , W e , b e are trainable parameters.
Instead of softmax or sparsemax (Martins and Astudillo, 2016), which would restrict the distribution of the scores, a soft attention based on sigmoid activation was used to obtain importance scores: where N is the number of tokens and σ is the logistic function. a i shows the importance of a particular token and is in the range 0 ≤ a i ≤ 1, independent of any other scores in the sentence; therefore, it can be directly used for sequence labeling with a natural threshold of 0.5. a i contains the same information but is normalized to sum up to 1 over the whole sentence, making it suitable for attention weights when building the sentence representation.
As a i and a i are directly tied, training the former through the sentence classification objective will also train the latter for the sequence labeling task.
The attention values were then used to obtain the sentence representation c by acting as weights for the biLSTM token outputs: Finally, the sentence representation c was passed through the final feedforward layer, followed by a sigmoid to obtain the predicted score y for the sentence: where d is the sentence vector, c is the sentence representation, and y is the sentence prediction score.
We adapt this approach to the transformer models by attaching a separate soft attention module on top of the token-level output representations. This effectively ignores the CLS token, which is commonly used for sentence classification, and instead builds a new sentence representation from the token representations, which replace the previously used biLSTM outputs: where T i is the contextualized embedding for token t i . A diagram of the model architecture is included in Appendix F. Commonly used tokenizers for transformer models split words into subwords, while sequence labeling datasets are annotated at the word level. We find that taking the maximum attention value over all the subwords as the word-level importance score produces good results on the development sets. For a word w i split into tokens [t j , ..., t m ], where j, m ∈ [1, N ], the resulting final word importance score r i is then given by: During training, we optimize sentence-level binary cross-entropy as the main objective function: where y (j) andỹ (j) are the predicted sentence classification logits and the gold label for the j th sentence respectively. We also adopt the additional loss functions from , which encourage the attention weights to behave more like token-level classifiers: Eq. 8 optimizes the minimum unnormalized attention to be 0 and therefore incentivizes the model to only focus on some, but not all words; Eq. 9 ensures that some attention weights are close to 1 if the overall sentence is classified as positive.

Weighted soft attention
Our experiments show that, when combined with transformer-based models, the soft attention method tends to spread out the attention too widely. Instead of focusing on specific important words, the model broadly attends to the whole sentence. Figures 3 and 4 in Appendix A present examples demonstrating such behaviour. As transformers contain several layers of attention, with multiple heads in each layer, the information in the sentence gets distributed across all tokens before it reaches the soft attention module at the top. To improve this behaviour and incentivize the model to direct information through a smaller and more focused set of tokens, we experiment with a weighted soft attention: where β is a hyperparamete and where values β > 1 make the weight distribution sharper, allowing the model to focus on a smaller number of tokens. We experiment with values of β ∈ {1, 2, 3, 4} on the development sets and find β = 2 to significantly improve token labeling performance without negatively affecting sentence classification results.

Datasets
We investigate the performance of these methods as zero-shot sequence labelers using three different datasets. Gold token-level annotation in these datasets is used for evaluation; however, the models are trained using sentence-level labels only. The CoNLL 2010 shared task (Farkas et al., 2010) 2 focuses on the detection of uncertainty cues in natural language text. The dataset contains 19, 542 examples with both sentence-level uncertainty labels and annotated keywords indicating uncertainty. We use the train/test data from the task and randomly choose 10% of the training set for development.
We also evaluate on the task of grammatical error detection (GED) -identifying which sentences are grammatically incorrect (i.e., contain at least one grammatical error). The First Certificate in English dataset FCE (Yannakoudakis et al., 2011) consists of essays written by non-native learners of English, annotated for grammatical errors. We use the train/dev/test splits released by Rei and Yannakoudakis (2016) for sequence labeling, with a total of 33, 673 sentences.
In addition, we evaluate on the Write & Improve (Yannakoudakis et al., 2018) and LOCNESS (Granger, 1998) GED dataset 3 (38, 692 sentences) released as part of the BEA 2019 shared task (Bryant et al., 2019). It contains English essays written in response to varied topics and by English learners from different proficiency levels, as well as native English speakers. As the gold test set labels are not publicly available, we evaluate on the released development set and use 10% of the training data for tuning 4 . For both GED datasets, we train the model to detect grammatically incorrect sentences and evaluate how well the methods can identify individual tokens that have been annotated as errors.

Experimental setup
We use the pre-trained RoBERTa-base (Liu et al., 2019) model, made available by HuggingFace (Wolf et al., 2020), as our transformer architecture. Following Mosbach et al. (2021), transformer models are fine-tuned for 20 epochs, and the best performing checkpoint is then chosen based on sentence-level performance on the development set. Each experiment is repeated with 5 different random seeds and the averaged results are reported. The average duration of training on Nvidia GeForce RTX 2080Ti was 1 hour. Significance testing is performed with a two-tailed paired t-test and a = 0.05. Hyperparameteres are tuned on the development set and presented in Appendices B and C.
The LIME and attention head methods provide only a score without a natural decision boundary for classification. Therefore, we choose their thresholds based on the token-level F 1 -score on the development set. In contrast, the soft attention and weighted soft attention methods do not require such additional tuning that uses token-level labels.

Results
The results are presented in Table 1. Each model is trained as a sentence classifier and then evaluated as a token labeler. The challenge of the zero-shot sequence-labeling setting lies in the fact that the models are trained without utilizing any gold tokenlevel signal; nevertheless, some methods perform considerably better than others. For reference, we also include a random baseline, which samples token-level scores from the standard uniform distribution; a RoBERTa model supervised as a sentence classifier only; and the model from  based on BiLSTMs.
We report the F 1 -measure on the token level along with Mean Average Precision (MAP) for returning positive tokens. The MAP metric views the task as a ranking problem and therefore removes Figure 1: Example word-level importance scores r i (Eq. 6) of different methods applied to an excerpt from the CoNLL10 dataset. HEAD corresponds to attention heads; SA to soft attention; and W-SA to weighted soft attention. We can observe how W-SA is the only method that correctly assigns substantially higher weights to the 'may' and 'seems' uncertainty cues.
the dependence on specific classification thresholds. In addition, we report the F 1 -measure on the main sentence-level task to ensure the proposed methods do not have adverse effects on sentence classification performance. Precision and recall values are included in Appendix E. LIME has relatively low performance on FCE and BEA 2019, while it achieves somewhat higher results on CoNLL 2010. Comparing the MAP scores, the attention head method performs substantially better, especially considering that it is much more lightweight and requires no additional computation. Nevertheless, both of these methods rely on using some annotated examples to tune their classification threshold, which precludes their application in a truly zero-shot setting.
Combining the soft attention mechanism with the transformer architecture provides some improvements over the previous methods, while also improving over . A notable exception is the CoNLL 2010 dataset where this method achieves only 8% F 1 and 20% MAP. Error analysis revealed that this is due to the transformer representations spreading attention scores evenly between a large number of tokens, as observed in Figure 1. Uncertainty cues in CoNLL 2010 can span across whole sentences (e.g., 'Either ... or ...'), with such examples encouraging the model to distribute information even further.
The weighted soft attention modification addresses this issue and considerably improves performance across all metrics on all datasets. Compared to the non-weighted version of the soft attention method, applying the extra weights leads to a significant improvement in terms of MAP, with a minimum of 5.01% absolute gain on FCE. The improvements are also statistically significant compared to the current state of the art : 5.35% absolute improvement on FCE; 9.38% on BEA 2019; and 3.36% on CoNLL 2010. While the F 1 on CoNLL 2010 is slightly lower, the MAP score is higher, indicating that the model has difficulty finding an optimal decision boundary, but nevertheless provides a better ranking. In future work, the weighted soft attention method for transformers could potentially be combined with token supervision in order to train robust multi-level models (Barrett et al., 2018;Rei and Søgaard, 2019).

Conclusion
We investigated methods for inferring tokenlevel predictions from transformer models trained only on sentence-level annotations. Experiments showed that previous approaches designed for LSTM architectures do not perform as well when applied to transformers. As transformer models already contain multiple layers of multi-head attention, the input representations get distributed between many tokens, making it more difficult to identify the importance of each individual token. LIME was not able to accurately identify target tokens, while the soft attention method primarily assigned equal attention scores across most words in a sentence. Directly using the scores from the existing attention heads performed better than expected, but required some annotated data for tuning the decision threshold. Modifying the soft attention module with an explicit sharpness constraint on the weights was found to encourage more distinct predictions, significantly improving token-level results.

A Example word-level predictions
We present samples of word-level predictions (word-level importance scores r i , Eq. 6) to illustrate differences between methods. In the figures that follow, HEAD refers to attention heads, SA to soft attention, and W-SA to weighted soft attention. Figure 2: CoNLL 2010 negative sentence (without uncertainty cues). We can clearly see that most methods correctly put weights close to 0 for all words, except from HEAD, which focuses on 'shown' and '.'. We surmise this is due to the fact that, for HEAD, weights over the whole sentence have to sum up to 1. . We can observe that HEAD correctly identifies both of the uncertainty cues: 'may' and 'seems'; however the weight for 'may' is quite low. Similarly, LIME identifies both tokens, but the weight for 'seems' is particularly low (lower than for 'to'). SA simply assigns high weights to all words. W-SA focuses primarily on the two uncertainty cue words; however, it also incorrectly focuses on 'not'. . We can see that both LIME and HEAD struggle to assign informative and/or useful weights to the words. All SA weights are relatively high, with small variations in value. We can see that squaring (W-SA) leads to more well-defined weights over the whole sentence, with high weights mainly observed in the second part of the sentence, which is the one that contains incorrect words. However, on this dataset, even W-SA struggles to correctly identify which words precisely are incorrect.          Table 10: Token-level results: P, R and F 1 refer to Precision, Recall and F-measure respectively on the positive class. MAP is the Mean Average Precision at the token-level.

B Hyperparameters
F Weighted soft attention architecture [e 1 , e 2 , ..., e n ] are attention vectors, and [a 1 , a 2 , ..., a n ] are normalized attention weights. d represents the output vector and y the final output logits.