Span Fine-tuning for Pre-trained Language Models

Pre-trained language models (PrLM) have to carefully manage input units when training on a very large text with a vocabulary consisting of millions of words. Previous works have shown that incorporating span-level information over consecutive words in pre-training could further improve the performance of PrLMs. However, given that span-level clues are introduced and fixed in pre-training, previous methods are time-consuming and lack of flexibility. To alleviate the inconvenience, this paper presents a novel span fine-tuning method for PrLMs, which facilitates the span setting to be adaptively determined by specific downstream tasks during the fine-tuning phase. In detail, any sentences processed by the PrLM will be segmented into multiple spans according to a pre-sampled dictionary. Then the segmentation information will be sent through a hierarchical CNN module together with the representation outputs of the PrLM and ultimately generate a span-enhanced representation. Experiments on GLUE benchmark show that the proposed span fine-tuning method significantly enhances the PrLM, and at the same time, offer more flexibility in an efficient way.


Introduction
Pre-trained language models (PrLM), including ELECTRA (Clark et al., 2020), RoBERTa (Liu et al., 2019b), and BERT (Devlin et al., 2018), have demonstrated strong performance in downstream tasks (Wang et al., 2018). Leveraging a self-supervised training on large text corpora, these models are able to provide contextualized representations in a more efficient way. For instance, BERT uses Masked Language Modeling and Nest Sentence Prediction as pre-training objects and is trained on a corpus of 3.3 billion words.
In order to be adaptive for a wider range of applications, PrLMs usually generate sub-token-level representations (words or subwords) as basic linguistic units. For downstream tasks such as natural language understanding (NLU), span-level representations, e.g. phrases and name entities, are also important. Previous works manifest that by changing pre-training objectives, PrLMs' ability to capture span-level information can be strengthened to some extent. For example, base on BERT, Span-BERT (Joshi et al., 2019) focuses on masking and predicting text spans, instead of sub-token-level information for pre-training. Entity-level masking is used as a pre-training strategy by ERNIE models (Sun et al., 2019;Zhang et al., 2019a). The upper mentioned methods prove that the introduction of span-level information in pre-training to be effective for different NLU tasks.
However, the requirements for span-level information of various NLU tasks differs a lot from case to case. The methods of introducing span-level information in pre-training phase, proposed by previous works, do not fit into the requirements and cannot improve the performance for all NLU tasks. For instance, ERNIE models (Sun et al., 2019) perform remarkably well in Relation Classification, while underperforms BERT in language inference tasks, such as MNLI (Nangia et al., 2017). Therefore, it is imperative to develop a strategy to incorporate span-level information into PrLMs in a more flexible and universally adaptive way. This paper proposes a novel approach, Span Fine-tuning (SF), to leverage span-level information in fine-tuning phase and therefore formulate a task-specific strategy. Compared to existing works, our approach requires less time and computing resources, and is more adaptive to various NLU tasks.
In order to maximize the value and contribution of span-level information, in additional to the sub-token-level representation generated by BERT, Span Fine-tuning also applies a computationally motivated segmentation to further improve the overall experience. Although various techniques, such as dependency parsing (Zhou et al., 2019) or semantic role labeling (SRL) (Zhang et al., 2019b), have been used as auxiliary tools for sentence segmentation, these methods demand extra parsing procedure, which increase complexities in actual practice. Span Fine-tuning first leverages a presampled n-gram dictionary to segment input sentences into spans. Then, the sub-token-level representations within the same span are combined to generate a span-level representation. Finally, the span-level representations are merged with subtoken-representations into a sentence-level representation. In this way, the sentence-level representation is able to contain and maximize the utilization of both sub-token-level and span-level information.
Experiments are conducted on the GLUE benchmark (Wang et al., 2018), which includes many NLU tasks, such as text classification, semantic similarity, and natural language inference. Empirical results demonstrate that Span Fine-tuning is able to further improve the performance of different PrLMs, including BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019b) and SpanBERT (Joshi et al., 2019). The result of the experiments with SpanBERT indicates that Span Fine-tuning leverages span-level information differently compared to PrLMs pre-trained with span-level information, which shows the distinguishness of our approach. It is also verified in ablation studies and analysis that Span Fine-tuning is essential for further performance improvement for PrLMs.

Pre-trained language models
Learning reliable and broadly applicable word representations has been an ongoing heated focus for natural language processing community. Language modeling objectives are proved to be effective for distributed representation generation (Mnih and Hinton, 2009). By generating deep contextualized word representations, ELMo (Peters et al., 2018) advances state of the art for several NLU tasks. Leveraging Transformer (Vaswani et al., 2017), BERT (Devlin et al., 2018) further advances the field of transfer learning. Recent PrLMs are established based on the various extensions of BERT, including using GAN-style architecture (Clark et al., 2020), applying a parameter sharing strategy (Lan et al., 2019), and increasing the efficiency of parameters (Liu et al., 2019b).

Span-level pre-training methods
Previous works manifest that the introduction of span-level information in pre-training phase can improve PrLMs' performance. In the first place, BERT leverages the prediction of single masked tokens as one of the pre-training objectives. Due to the use of WordPiece embeddings (Wu et al., 2016), BERT is able to segment sentences into subword level tokens, so that the masked tokens are at sub-token-level, e.g. "##ing". (Devlin et al., 2018) shows that masking the whole word, rather than only single tokens, can further enhance the performance of BERT. Later, it is proved by (Sun et al., 2019;Zhang et al., 2019a) that the masking of entities is also helpful for PrLMs. By randomly masking adjoining spans in pre-training, SpanBERT (Joshi et al., 2019) can generate better representation for given texts. AMBERT (Zhang and Li, 2020) achieves better performance than its precursors in NLU tasks by incorporating both sub-token-level and span-level tokenization in pretraining. The upper mentioned studies all focus on introducing span-level information in pre-training. To the best of our knowledge, the introduction of span-level information in fine-tuning is still a white space to explore, which makes our approach a valuable attempt.

Integration of fine-grained representation
Different formats of downstream tasks require sentence-level representations, such as natural language inference (Bowman et al., 2015;Nangia et al., 2017), semantic textual similarity (Cer et al., 2017) and sentiment classification (Socher et al., 2013). Besides directly pre-training the representation of coarser granularity (Le and Mikolov, 2014;Logeswaran and Lee, 2018), a lot of methods have been explored to obtain a task-specific sentence-level representation by integrating finegrained token-level representations (Conneau et al., 2017). Kim (2014) shows that by applying a convolutional neural network (CNN) on top of pretrained word vectors, we can get a sentence-level representation that is well adapted to classification Inspired by these methods, after a series of preliminary attempts, we choose a hierarchical CNN architecture to recombine fine-grained representations to coarse-grained ones. Figure 1 demonstrates the overall framework of Span Fine-tuning, which is essentially uses BERT as a foundation and incorporates segmentation as an auxiliary tool. The figure does not exhaustively depict the details of BERT, given the model is relatively popular and ubiquitous. Further information on BERT is available in (Devlin et al., 2018). In Span Fine-tuning, an input sentence is divided into sub-word-level tokens and then sent to BERT to generate sub-token-level representations. At the same time, the input is segmented into spans based on n-gram statistics. By combining the segmentation information with sub-token-level representations generated by BERT, we divided the repre-sentation into several spans. Then, the spans are sent through a hierarchical CNN module to obtain a span-level information enhanced representation. Finally, the sub-token-level representation of [CLS] token generated by BERT and the span-level information enhanced representation are concatenated and form a final representation, which maximized the value of both sub-token-level and span-level information for NLU tasks.

Sentence Segmentation
Semantic role labeling (SRL) (Zhang et al., 2019b) and dependency parsing (Zhou et al., 2019) have been used as auxiliary tools for segmentation by previous works. Nonetheless, these techniques demand additional parsing procedures, and therefore increase complexities for real application. In order to obtain a simpler and more convenient segmentation, base on frequency, we select meaningful n-grams appeared in wikitext-103 dataset * to form a pre-sampled dictionary.
We use the dictionary to match n-grams from the head of each input sentence. n-grams with greater lengths are prioritized, while unmatched tokens remain the same. In this way, we are able to

the model is intended to give managers an overview of the acquisition process and to help them decrease acquisition risk the model is intended to give managers an overview of the acquisition process and to help them decrease acquisition risk
(c) Figure 2: Segmentation Examples obtain a specific segmentation of the input sentence. Figure 2 demonstrates some examples of sentence segmentation from the GLUE dataset.

Sentence Encoder Architecture
An input sentence X = {x 1 , . . . , x n } is given with a length n. The sentence is firstly divided into sub-word tokens (with a special token [CLS] at the beginning) and converted to sub-token-level representations E = {e 1 , . . . , e m } (usually m is larger than n) according to embeddings proposed by (Wu et al., 2016). Then, the transformer encoder (BERT) captures the contextual information for each token by self-attention and generates a sequence of sub-token-level contextual embeddings T = {t 1 , . . . , t m }, in which t 1 is the contextual representation of special token [CLS]. Based on the segmentation generated by the n-gram statistics, the sub-token-level contextual representations are combined into several spans {C 1 , . . . , C r }, with r as a hyperparameter indicating the max number of spans for all processed sentences. Each C i contains several contextual sub-token-level representations extracted from T dedoted as {t i 1 , t i 2 , ..., t i l }. l is another hyperparameter representing the max number of tokens for all the spans. A CNN-Maxpooling module is applied to each C i to get a span-level representation c i : where W 1 and b 1 are trainable parameters and k is the kernel size. Based on the span-level representations {c 1 , . . . , c r }, another CNN-Maxpooling module is applied to obtain a sentence-level representation s with enhanced span-level information: Finally, we concatenate s with the contextual sub-token-level representation t 1 of special token [CLS] provided by BERT, and generate a sentence-level representation s * that maximizes the value of both sub-token-level and span-level information for NLU tasks: s * = s t 1 .

Tasks and Datasets
To evaluate Span Fine-tuning, experiments are conducted on nine NLU benchmark datasets, including text classification, natural language inference, semantic similarity. Eight of which are available from the GLUE benchmark (Wang et al., 2018). And the rest one is SNLI (Bowman et al., 2015), a widely accepted natural language inference dataset.

Pre-trained Language Model
We leverage the PyTorch implementation of BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019b) and

Set Up
We select all the n-grams with n ≤ 5, which occurs more than ten times in the wikitext-103 dataset, to form a dictionary. The pre-sampled dictionary, containing more than 400 thousand n-grams, is used to segment input sentences. During segmentation, two hyperparameters are involved: r representing the largest number of spans in a sentence, and l indicating the largest number of tokens included in a span. In order to maintain different dimensions of features for each input sentence, padding and tail are employed. We set r equals to 16, and based on NLU tasks, choose l in {64,128} .
The fine-tuning procedure is as the same as BERT's. Adam is used as the optimizer. The initial learning rate is in {1e-5,2e-5, 3e-5}, the warm-up rate is 0.1, and the L2 weight decay is 0.01. The † https://github.com/huggingface  , (Radford et al., 2018). For a simple demonstration, problematic WNLI set are excluded, and we do not show the accuracy of the datasets have F1 scores. mc and pc denote the Matthews correlation and Pearson correlation respectively. batch size is set in {16, 32, 48}. The maximum number of epochs is set in {2,3,4,5} based on NLU tasks. Input sentences are divided into subtokens and converted to WordPiece embeddings, with a maximum length in {128, 256}. The output size of the CNN layer is the same as the hidden size of PrLM, and the kernel size is set to 3.

Results with BERT as PrLM
Two released BERT (Devlin et al., 2018), BERT Large Whole Word Masking and BERT Base, are first used as pre-trained encoder and baselines for Span Fine-tuning. Compared with BERT Large, BERT Large Whole Word Masking reach a better performance, since it uses whole-word masking in pre-training phase. Therefore, we select BERT Large Whole Word Masking as a stronger baseline. The results indicate that Span Fine-tuning can maximize the contribution of span-level information, even when compared to a stronger baseline. Table 1 exhibits the results on the GLUE datasets, showing that Span Fine-tuning can significantly improve the performance of PrLMs. Since our approach leverages BERT as a foundation, and undergoes the the same evaluation procedure, it is evident that the performance gain is fully contributed by the newly introduced Span Fine-tuning.
In order to test the statistical significance of the results, we follow the procedure of . We use the McNemars test, this test is designed for paired nominal observations, and it is appropriate for binary classification tasks.The pvalue is defined as the probability of obtaining a result equal to or more extreme than what was ob-served under the null hypothesis. The smaller the p-value, the higher the significance. A commonly used level of reliability of the result is 95%, written as p = 0.05. As shown in table 2, compared with the baseline, for all the binary classification tasks of GLUE benchmark, our method pass the significance test.  Span Fine-tuning can reach the same performance improvement as previous methods. As illustrated in Table 1, on average, SpanBERT can improve the result by one percentage point over the baseline (BERT-1seq), while Span Fine-tuning is able to achieve an improvement of 1.1 percentage points over our baseline. However, as showed in Table 3, Span Fine-tuning requires considerably less time and computing resources compared to the large-scale pre-training for span-level information incorporation. When the Span Fine-tuning is adopted, the extra parameters are only 3 percent of the total parameters of the adopted PrLMs for every downstream task, and introduce little extra overhead.
Besides, Span Fine-tuning is more flexible and adaptive compared to previous methods. Table    ‡ The baseline of SpanBert, a BERT pre-trained without next sentence prediction object.

Method Time Resource
Pre-train 32 days 32 Volta V100 Span Fine-tune 12 hours max 2 Titan RTX 1 shows that Span Fine-tuning is able to achieve stronger results on all NLU tasks compared to the baseline, whereas the results of SpanBERT in certain tasks, such as Quora Question Pairs and Microsoft Research Paraphrase Corpus, are worse than its baseline. Since for spanBERT, the utilization of span-level information is fixed for every downstream task. Whereas in our method, an extra module designed to incorporate span-level information is trained during the fine-tuning, which can be more dynamically adapted to different downstream tasks. Table 5 indicates that Span Fine-tuning also enhances the result of PrMLs on the SNLI benchmark. The improvement achieved by Span Fine-tuning is similar to published state-of-the-art accomplished by SemBERT. However, compared to SemBERT, Span Fine-tuning saves a lot more time and computing resources. Span Fine-tuning merely leverages a pre-sampled dictionary to facilitate segmentation, whereas SemBERT leverages a pre-trained semantic role labeller, which brings extra complexities to the whole segmentation process.
Furthermore, Span Fine-tuning is different from SemBERT in terms of motivation, method and contribution factors. The motivation of SemBERT is to enhance PrLMs by incorporating explicit contextual semantics, whereas the motivation of our work is to let PrLMs leverage span-level information in fine-tuning. When it comes to method, SemBERT concatenate the original representations given by BERT with representations of semantic role labels, in comparison, our work directly leverages a segmentation given by a pre-sampled dictionary to generate span-enhanced representation and requires no pre-trained semantic role labeler. The gain of Sem-BERT comes from semantic role labels while the gain of our work comes from the specific segmentation, which is very different.
It's worth noticing that semantic role labeler can also generate segmentation. However, semantic role labeler will generate multiple segmentation for sentence which has various predicate-argument structures. Furthermore, such segmentation is sometimes coarse-grained (with span more than ten words), which is unpractical for our work.

Results with Stronger PrLMs
In addition to BERT, we also apply Span Finetuning to stronger PrLMs, such as RoBERTa (Liu et al., 2019b) and SpanBERT (Joshi et al., 2019), which optimize BERT by enhancing pre-training procedure and predicting text spans rather than single tokens respectively. Table 4 shows that Span Fine-tuning can strengthen both RoBERTa and SpanBERT. RoBERTa is a already very strong baseline, we remarkably improve the performance of RoBERTa on RTE by four percentage points. SpanBERT already incorporated span-level information during the pre-training, but the results still support that Span Fine-tuning utilizes the span-level formation and improves the performance of PrLMs in a different dimension.

Ablation Study
In order to determine the key factors in Span Finetuning, a series of studies are conducted on the dev sets of eight NLU tasks. BERT BASE is chosen as the PrLM for the ablation studies. As shown in Table  6, three sets of ablation studies are performed. For experiment BERT BASE + CNN, only a hierarchical CNN structure is applied in to evaluate whether the improvement comes from the extra parameters. To illustrate, we firstly apply two layers of CNN over the token-level representations given by BERT. Then, a max pooling operation is applied to get the sentence-level representation. Finally, the sentencelevel representation and the 'CLS' representation of BERT is concatenated and sent to the classifier. In this way, the parameters of BERT BASE + CNN are the same as in our method. For experiment BERT BASE + CNN + Random SF, random sentence segmentation is applied to the experiment to test if the proposed segmentation method of Span Fine-tuning really functions in span-level information incorporation. For experiment BERT BASE + CNN + NLTK SF, we conduct the experiments using a pre-trained chunker from Natural Language Toolkit to see whether the proposed segmentation method of Span Fine-tuning can achieve further improvements.    The results of the experiment BERT BASE + CNN suggest that the improvement is unlikely to come from the extra parameters, since it reduce the overall performance by 0.1 percent. The experiment BERT BASE + Random SF and BERT BASE + NLTK SF indicate that the segmentation generated by a pre-train chunker or even random segmentation can also achieve enhancement under the Span Fine-tuning structure. However, a pre-trained chunker demands additional part-of-speech parsing process, while our segmentation method only relies on a pre-sampled dictionary and saves a lot more time, and at the same time, achieves greater improvement. Our Span Fine-tuning is able to remarkably enhance the result on all NLU tasks, raising average score by 1.6 percentage points. Overall, the § Random SF represents Span Fine-tuning with randomly segmented sentences. ¶ NLTK SF represents Span Fine-tuning with segmentation generated by an NLTK pre-trained chunker.
result of experiments indicate that the performance improvement is primarily a result of our unique segmentation method. (Conneau et al., 2017) mentions that the influence of sentence encoder architectures on PrLM performance varies a lot from case to case. (Toshniwal et al., 2020) (Zhang et al., 2019b) is the published SOTA on SNLI.

Encoder Architecture
To evaluate the effectiveness of our encoder architecture, we replace the component of the encoding layer and the overall structure respectively. For the component of the encoding layer, CNN (Kim, 2014) and the Self-attentive module (Lin et al., 2017) are compared. For the overall structure, two structures are considered: a single layer structure with the max-pooling operation and a hierarchical structure.
By matching every component of the encoding layer with the overall structure, four different encoder architectures are generated: CNN-Maxpooling, CNN-CNN, Attention-Maxpooling, Attention-Attention. Experiments are conducted on SNLI dev and test sets. Table 7 suggests that the hierarchical CNN (CNN-CNN) is most suitable encoder architecture for us.

Size of n-gram Dictionary
Since our segmentation method is based on a presampled dictionary, the size of dictionaries will have a large impact on segmentation results. Figure  3 depicts how the average number of spans in the sentences changed along with dictionary size in CoLA and MRPC datasets. At the origin, where no segmentation is applied, every token is considered as a span. The number of spans drops significantly, as the dictionary size grows and more n-grams are matched and grouped together.  To evaluate the influence of dictionary size on PrLM performance, experiments on the dev sets of two NLU tasks are implemented: CoLA and MRPC. To concentrate on the impact of segmentation and reduce the impacts from sub-token-level representations provided by PrLM, the concatenation process is not applied to this experiment. Rather, the span-level information enhanced representations are directly sent to a dense layer to generate prediction. As demonstrated in figure 4, the incorporation of pre-sampled n-gram dictionary generates a stronger performance compared to random segmentation. Moreover, dictionaries of medium sizes (20k to 200k) commonly result in better performance. Such trend can be explained by intuition, give dictionaries of small sizes are likely to omit meaningful n-grams, whereas the ones of large sizes tend to over-combine meaningless ngrams.

Span Fine-tuning for Token-Level Tasks
The upper mentioned experiments are conducted on the GLUE benchmark, whose tasks are all at the sentence level. Nevertheless, token-level representations are needed in many other NLU task, such as name-entity recognition (NER). Our approach can be applied to token-level tasks with simple modification of encoder architecture (e.g. removing the pooling layer of CNN module). Table 8 shows the results of our approach on the CoNLL-2003 Named Entity Recognition (NER) task (Tjong Kim Sang and De Meulder, 2003)

Conclusion
This paper proposes Span Fine-tuning that maximize the advantages of flexible span-level information in fine-tuning with sub-token-level representations generated by PrLMs. Leveraging a reasonable segmentation provided by a pre-sampled n-gram dictionary, Span Fine-tuning can further enhance the performance of PrLMs on various downstream tasks. Compared with previous span pre-training methods, our Span Fine-tuning remains competitive for the following reasons: Task-adaptive For methods that incorporate span-level information in pre-training, the utilization of span-level information is unlikely easily adjusted for every downstream task as span pretraining has been fixed after tremendous computational cost. In our method, the extra module designed to incorporate span-level information is trained during the fine-tuning, resulting in a more dynamically adaptation to different downstream tasks.
Flexible to PrLMs Our approach can be generally applied to various PrLMs including RoBERTa and SpanBERT.
Novelty Our approach can further improve the performance of PrLMs pre-trained with span-level information (e.g. SpanBERT). Such result implies that we our method utilizes the span-level information in a different manner comparing with PrLMs pre-trained with span-level information, which makes our method distinguished comparing with previous works.