Is BERT a Cross-Disciplinary Knowledge Learner? A Surprising Finding of Pre-trained Models' Transferability

This paper investigates whether the power of the models pre-trained on text data, such as BERT, can be transferred to general token sequence classification applications. To verify pre-trained models' transferability, we test the pre-trained models on text classification tasks with meanings of tokens mismatches, and real-world non-text token sequence classification data, including amino acid, DNA, and music. We find that even on non-text data, the models pre-trained on text converge faster, perform better than the randomly initialized models, and only slightly worse than the models using task-specific knowledge. We also find that the representations of the text and non-text pre-trained models share non-trivial similarities.


Introduction
In recent NLP research, pre-trained masked language models (MLMs) such as BERT (Devlin et al., 2019) are widely used by practitioners. After pretrained on large corpora such as Wikipedia, the models can be fine-tuned quickly on NLP tasks like text classification and question answering and generalize well on small datasets such as RTE in GLUE (Wang et al., 2018). To apply and improve BERT in a more specialized domain such as scientific articles or clinical data, several MLMs are proposed by pre-training on the domain-specific text data (Lee et al., 2020;Beltagy et al., 2019). The concept of MLM can also be extended to other disciplines (maybe non-linguistic) as long as the input is discrete. For example, Min et al. (2019) pretrain MLMs called PLUS on amino acid sequence data and achieve state-of-the-art performance on several protein classification tasks.
This paper examines whether the model pretrained on large text corpora, such as BERT, can be efficiently adapted to data with numbers of tokens, token distribution, labels, and structure very different from natural language (the target data could even be non-text). We refer to this ability as discipline adaptability 1 . Previous work (Papadimitriou and Jurafsky, 2020) only shows that language models (LMs) pre-trained on non-text data can be adapted to LMs of human languages. This work is the first to examines if the pre-trained MLMs can learn the relation between the label and the data never seen during pre-training. Our contributions are the following.
• We propose settings to examine the discipline adaptability of the pre-trained models. We find that BERT, BERT-Chinese, ALBERT, and RoBERTa can reduce training loss much more quickly, generalize better than the randomly initialized models on the non-text data, and are just slightly worse than the models using prior knowledge within each discipline.
• Our analyses indicate that before fine-tuning, the similarity between BERT and the MLMlike model pre-trained on the non-text data is much higher than the one between BERT and the randomly-initialized model, which helps to explain the success of BERT within the nontext disciplines. Furthermore, our extensive investigation of several hypotheses about attention similarity, hierarchical structure in the non-text data, and training stability indicates that these hypotheses are not sufficient to explain the discipline adaptability of BERT.
We believe the findings of discipline adaptation will intrigue the NLP community to ponder what is learned in the pre-training procedure. The findings can also be helpful to the disciplines that largescale datasets are not available, which are essential for practitioners. 1 We use the term discipline adaptation instead of domain adaptation. In NLP, domain adaptation usually refers to the setting like the transfer from general text to specialized text data such as scientific articles. We use the term discipline to emphasize data with very different distribution and structure.

Method
To examine the discipline adaptability of the models pre-trained on text corpora, we fine-tune the pretrained models on two types of downstream data. The first type (section 2.1) are synthetic datasets, which are generated by permuting tokens in common NLP datasets. So the meaning of each token is changed. The second type of data (section 2.2) is a more challenging situation in which the downstream tasks are not relevant to human language.

Synthetic data
We first test the models on synthetic text sequence classification data. The synthetic data is generated as below. Given a text sequence classification dataset, we define a deterministic one-to-one mapping T that changes a subword token x i in a text sequence to another subword token T (x i ), as shown in figure 1a. To elaborate on the design of synthetic text data, consider the tokens in the text corpora as nodes in a graph and the relationship among tokens as edges. Then the graph of the synthetic data is an isomorphism of the original graph. Hence, the structure of the synthetic data (the structure of the graph) is identical to the original one, and the tasks of the synthetic datasets are as difficult as the original tasks (if the pre-training procedure is not considered). Suppose the pre-trained model can still outperform the model trained from scratch on this artificial data. This indicates that the pretrained models can transfer knowledge to the downstream tasks with meanings of tokens completely different from pre-trained corpora. And therefore, it is probable that we can further take advantage of the pre-trained model when processing real-world non-text data.
In our experiments, we first pre-train the model on normal text corpora, and then we fine-tune and test the model on the synthetic data. We choose T (i) = (i + 1000) mod D, where D is the vocabulary size of the model. We have also tried generating the mappings randomly. The results are similar and left in the appendix. For a real example, the sentence "his healthy sense of satire is light and fun..." in GLUE dataset will be changed to "canadian franzme 1988pia leader watch sports czech at at at" 2 .

Real-world non-text data
To further validate the discipline adaptability of the pre-trained model, we fine-tune the pre-trained model on real-world non-text data. In these downstream tasks, both the token distributions and the number of tokens could be very different from the text data for pre-training. So this is a more difficult setting to evaluate the transferability of pre-trained models.
To process non-text data by BERT, we map each token of the non-text data to one subword token as in figure 1b. In the following experiments, the (deterministic) mapping table is generated randomly because we find that different mappings lead to similar results as long as we do not map the non-text tokens to the unused tokens of the pre-trained models. We add a randomly initialized linear classifier on top of the pre-trained model in the fine-tuning phase without randomly initializing any pre-trained parameters, including the embedding layer. Then we fine-tune the whole model.

Setup
We use GLUE dataset to generate the synthetic data. The validation sets are used to test the models. WNLI is excluded as in (Devlin et al., 2019); For the real-world non-text data, we include the following tasks with different numbers of tokens, token distributions, and structures: Protein classification (3 tasks): Localization  The average scores (y-axis) of the pre-trained models and the ones trained from scratch within each discipline. The higher the scores, the better the models. The black error bars stand for standard deviation among random seeds. The red lines are the performance of the discipline-specific models. "-c": chinese.
The input is DNA sub-sequences consisting of 4 different tokens.
Music composer classification (1 task): We use MAESTRO-v1 dataset (Hawthorne et al., 2019). The input is pitch sequences consisting of 128 different tokens.
The pre-trained models used in the experiments include BERT-base-uncased, BERT-base-Chinese, ALBERT-base-v1, and RoBERTa-base. The randomly initialized (trained from scratch) models have the same architectures as BERT-base. The experiments on BERT-large are left in the appendix due to space limitations. The models are initialized by the default distribution widely adopted for pretraining the models (e.g., N (0, 4×10 −4 ) for BERTbase). For detailed hyperparameters, please refer to the appendix. For simplicity, we use "pre-trained models" to refer to the above models pre-trained on the natural language if not specified. Figure 2 shows the average scores of the pre-trained models (blue bars) and the models trained from scratch (orange bars) within each discipline. The means (bars) and standard deviations (black error bars) of figure 2a are calculated over three random seeds, and the ones in figure 2b, 2c, and 2d are calculated over six independent runs (with different token mappings). The GLUE score of BERT fine-tuned on normal GLUE acts as the disciplinespecific top-line (red lines); For protein classification, the discipline-specific model is PLUS-TFM (Min et al., 2019), which is a 12-layer transformer MLM pre-trained on protein sequence; For DNA classification, the discipline-specific model is Hilbert-CNN (Yin et al., 2018); For music composer classification, we use all the classes in the dataset to classify. But previous works (Kim et al., 2020;Spijker, 2020) use only part of the classes, so no discipline-specific models are available. Detailed scores of each task within each discipline are left in the appendix.

Results
The pre-trained models outperform the trained from scratch models in all disciplines. The phenomenon is general over pre-trained models with different model structures (ALBERT), pre-training objectives, amount of pre-training data (RoBERTa), and different natural languages (BERT-Chinese). Furthermore, the pre-trained models perform just slightly worse than PLUS-TFM and Hilbert-CNN without using any discipline-specific knowledge. The standard deviations of most models and disciplines are small, which implies that the effect of different token mappings is marginal.
At first sight, fine-tuning the pre-trained models on synthetic GLUE seems equivalent to randomly initializing the word embedding layer and then finetuning the pre-trained models on normal GLUE, which we called re-embedding (re-emb). If the equivalence is true, an explanation for the performance gain of the pre-trained models is just that the intermediate layers are already trained. Nevertheless, figure 2a shows the equivalence does not hold. re-emb (green bar) degrades the performance. For the non-text data, the performance of re-emb is also worse than the models with all pre-trained parameters in figure 2b, 2c, and 2d. Accordingly, the pre-trained word embedding layer benefits the nontext downstream tasks even though the meanings of the tokens are different from pre-training. We also find that using unused tokens of the pre-trained models even makes the performance degenerate to the trained from scratch baseline.

Discussion
The results in section 3.2 validate the potential of the pre-trained models as strong cross-disciplinary knowledge learners. The success of the pre-trained models could stem from better generalization ability or better training loss dynamics. In this section, we analyze the contribution of the pre-training in terms of optimization and generalization. In addition, we try to explain the success of the pretrained models by comparing the representations from BERT and PLUS on the protein data. Figure 3 shows that BERT always reduces the training loss faster than the models trained from scratch. For fluorescence task in figure 3b, the model trained from scratch seems stuck at local minimum rapidly, while BERT gets out of local minimum as finetuning proceeds. For a small dataset like STS-B in figure 3a, BERT can reduce the training loss in only hundreds of steps, but the training loss of the model trained from scratch is still high. The results of the other tasks are similar and left in the appendix.  Table 1: PWCCA similarity (a value in [−1, 1]) between the representations of the last layer of the models on protein data. All the models are not fine-tuned. "random" means the randomly initialized models with the same architecture of BERT.

Generalization ability
Then we further examine the generalization ability of pre-trained models. We train all the models on only 1% of the non-text data. In this way, both pretraining models and models training from scratch can converge to almost zero loss. And we compare their validation performance to know their generalization ability. Figure 4 show the results of one DNA dataset and one protein dataset. The results of the other tasks are similar and left in the appendix. Under the setting of 1 % training data, the training losses of the pre-trained models and the models trained from scratch both converge to zero. And the pre-trained models still surpass the trained from scratch ones on the validation sets. Therefore, model pre-training improves the model generalization ability in discipline adaptation.

Representation similarity
To explain the success of the text pre-trained models on the non-text data, we apply Projection Weighted Canonical Correlation Analysis (PWCCA) (Morcos et al., 2018) on the representations of BERT and PLUS-TFM. The results in table 1 show that before fine-tuning, the similarity between BERT and PLUS is much higher than the similarity between BERT and the randomly initialized model. The behavior of BERT is different from the randomly initialized models when processing the non-text data, even though BERT is pre-trained only on natural language, and this could be one of the reasons behind BERT's discipline adaptability.

Hypotheses
To elaborate on the reason behind the discipline adaptability of the pre-trained MLM, we have tried to explore several possibilities. However, they are not sufficient to explain the phenomenon. In the next sub-sections, we summary these experiments. Some detailed results are left in the appendix.

Attention similarity
We have examined the similarity of the attention map between BERT, PLUS, and the randomly initialized model. For one input data, we first extract their attention maps in each layer. For the attention maps in the same layer, we use the Hungarian algorithm to find the minimum L1 distance matching between the maps from the different models. The average distance of the matching represents the similarity of attention patterns in each layer. The results on the protein data are shown in fig 5. In almost all layers, the distance between BERT and PLUS is larger than the one between BERT and random, no matter whether they are fine-tuned or not. From this viewpoint, we may not consider that PLUS and BERT share common attention patterns.

Properties of the pre-training data
We would like to investigate which properties of the pre-training data result in discipline adaptability. We used the following data used in (Chiang and Lee, 2020) to pre-train MLMs: • uniform: Tokens in a sentence are sampled i.i.d from uniform distribution over all tokens.
• flat or nesting parentheses: Tokens in a sentence are generated randomly, recursively, while hierarchically matched.
• Kannada (Ortiz Suárez et al., 2020): Kannada is a language spoken by people in southwestern India.
However, as shown in table 2, the models pretrained on the artificial data perform worse on the protein classification than the one pre-trained on natural languages. So natural language may indeed  share similarities with protein, while the hierarchical structure only may not be enough to explain the discipline adaptability.

Gradient stability
We also examine whether BERT satisfies the following criteria about training stability or not: • Saxe et al. (2014) claim that if the singular values of the output-input Jacobian matrix of the model initialization are all equal to 1 (called dynamical isometry), then the model can avoid gradient vanish or gradient explode and be trained better.
• Sankararaman et al. (2020) show that negatively correlated gradients produced by different data would slow down the convergence.
• Liu et al. (2020) observe that a large variance of the output of transformer under parameter perturbation would make the training procedure unstable.
On the synthetic GLUE, BERT does not fit these criteria better, or even worse than the gaussian initialization. Although BERT is optimized better even on the non-text data, the above theories fail to elaborate the optimization properties of BERT. The detailed results are left in the appendix.

Conclusion
In this paper, we investigate the potential of BERT as a cross-disciplinary knowledge learner. By finetuning BERT on the synthetic text data with meanings of tokens changed and the non-text data, we verify that BERT can be adapted to data of different disciplines efficiently and generalizes well. Besides, we discover the non-trivial similarity between the models pre-trained on text and protein before fine-tuning by PWCCA, which helps to explain the reasons behind BERT's discipline adaptability. We hope that the proposed settings can act as new analysis tools for researchers and provide new insight into the power of pre-trained models.

Broader impact
The results of this paper are helpful for practitioners of other disciplines when large-scale pre-training datasets are unavailable. The discipline adaptability of the pre-trained models also helps to reduce computational costs since we may not need to pretrain one model for each discipline. We think that the results in this paper will not cause any ethical issues.

A Hyperparameters for experiments
The transformer models used in the experiments are 12-layer, 768-hidden, 12-attention heads models if not specified. The total number of parameters is 110M, which is the same size as BERT-base. For BERT-large-uncased (bert-l) and the large model trained from scratch (scratch-l) in the appendix, the total number of parameters is 340M. PLUS-TFM has the same structure as BERT-base and the total number of parameters is 110M. For the Hilbert-CNN model, the total number of parameters is 961K according to the original paper. We use Adam optimizer for all experiments in the paper, and the learning rate is set to 10 −5 . The optimizer is chosen by applying grid search on MRPC from GLUE dataset, and the learning rate is chosen by applying grid search on MRPC and the validation set of fluorescence protein classification task. We search learning rate from 10 −4 to 10 −7 . We uniformly sample 5 points in this range, and further sample 5 points between 10 −5 and 10 −6 . We search optimizers including Adagrad, Adam, Adamax, RAdam, and NovoGrad for three independent runs. The parameters that make the randomly initialized 12-layer transformer models achieve the highest F1 score on the MRPC training set and the highest Spearman correlation on the fluorescence validation set are chosen (which are also the best for the re-emb setting). We do not use gradient clipping and warm-up, so the learning rate schedule is the same as linear learning rate decay. All models are trained with batch size 32 on two RTX 2080-Ti (GLUE dataset) or one Tesla V100 GPU (protein classification, DNA classification, and music classification). For GLUE dataset, we use the validation set of GLUE as testing set and evaluate the final checkpoint. For all the non-text datasets, we select the best checkpoints on the validation set and evaluate on the testing set.

B Full results on synthetic GLUE
The full results on synthetic GLUE dataset are shown in table 3. The pre-trained models (including the large model) outperform the models trained from scratch except for SST-2 and CoLA. For SST-2, pre-trained models generalize worse than the models trained from scratch. For CoLA, all models fail to be trained. But for the other six tasks, pretrained models outperform the models trained from scratch. The standard deviations of most of the models are small than 2 except for the RTE dataset and the large models. For RTE, the maximum standard deviation is 5.24 (ALBERT). For the large models, the standard deviations are much larger and listed in table 4. When we generate the token mappings randomly, the results are similar. This indicates that the effect of different token mappings is marginal.
For BERT with word embedding layer randomly initialized and then fine-tuned (re-emb), the performance is worse than the one using the whole pre-trained weights, which indicates that even pretrained word embedding is necessary.

C Testing and validation performance on
non-text data Table 5 and 7 show the full testing and validation results on each non-text classification task. Table 6 and 8 show the average scores of each discipline. For most of the tasks, the text pre-trained models outperform the models trained from scratch and the re-emb models on both the testing and validation set.   Table 3: Full results on GLUE validation set (averaged over three random seeds). The evaluation metrics are listed below the task names. Normal data means the models are fine-tuned on the normal GLUE. Permutation means the models are fine-tuned on the synthetic GLUE. Random mapping means the token mapping is generated randomly. "avg": The average score (GLUE score). "m/mm": MNLI matched/mismatched set. "spr": Spearman correlation. "mcc": Matthews correlation coefficients. "re-emb": Randomly initializing the word embedding layer of BERT and fine-tuning the BERT.  tasks and the other non-text datasets, respectively. BERT can reduce the training loss more quickly than the models trained from scratch except for the SST-2 task, on which BERT performs worse. The results are consistent over disciplines. Figure 8 and 9 shows the results of BERT and the model trained from scratch with only 1% training data of the GLUE dataset and the non-text datasets. We do not conduct the experiment on splice and maestro-v1 datasets due to the limitation of the size of the training sets. For most of the tasks, BERT generalizes better than the models trained from scratch.

E generalization experiments for other tasks
F Detailed results of section 4.4 F.1 Dynamical isometry Figure 10 shows the distribution of the singular values of the output-input jacobian matrices of BERT-base, BERT-large, and ALBERT-base. The jacobian matrices are computed by calculating the derivative of the representation from the last layer with respect to the input word embeddings. And the input data is from normal GLUE dataset. Compared to the random initialization (scratch in fig  10), the singular values of BERT and ALBERT concentrate at zero but not one, which is opposite to the hypothesis of dynamical isometry. 11.6 (3.5) 76.7 (0.5) 58.7 (2.5) 67.1 (7.9) 95.6 (0.8) Table 5: Testing results of protein classification and DNA classification. The metric is Spearman correlation for fluorescence and stability. And the metric is accuracy for all the other tasks. The number in the parenthesis is the standard deviation (calculated over six independent runs with different token mappings). "specific": the disciplinespecific models. bert-l 43.6 (14.7) 70.4 (6.7) 30.8 (4.0) scratch-l 43.2 (2.6) 74.5 (1.8) 26.0 (5.0) Table 6: The testing results of music composer classification, the average score of DNA classification, and the average score of protein classification. The numbers in the parenthesis are the standard deviations calculated over six independent runs with different token mappings.
fore, it is hard to claim that the power of BERT and ALBERT originates from dynamical isometry. Figure 11 shows the cosine similarity of gradients produced by different data points in synthetic GLUE dataset. Although the cosine similarity of BERT is larger than the random initialized (scratch) counterpart, ALBERT shows adverse trends. The cosine similarity of pre-trained ALBERT is smaller than the scratch counterpart. But pre-trained AL-BERT still outperforms the random initialization, which indicates that avoiding gradient confusion may not be the key to pre-trained MLMs' discipline adaptability.

F.3 Output variance under perturbation
We inject zero mean gaussian noise to the model parameters to calculate the variation of the model's outputs under the noise. The variation is represented by the L2 distance of the model's outputs before and after adding the noise. We choose the magnitude of standard deviation to be 10 −2 , 10 −4 , 10 −6 , and 10 −8 . Figure 12 and 13 show the results of BERT-base and ALBERT-base on the three synthetic GLUE tasks, respectively. We find that BERT and ALBERT show contrary trends: The variation of BERT is smaller than the randomly initialized counterpart, while the one of ALBERT is larger than the counterpart. So this hypothesis is not sufficient to explain the discipline adaptability of the pre-trained models.

G Statistics of datasets G.1 GLUE
GLUE is an English dataset that consists of several tasks. Table 9 shows the statistics of GLUE. We use the validation set as the test set in our experiments. The train/validation split can be found in the downloaded data.   Table 8: Validation results of music composer classification, average score of DNA classification, and average score of protein classification. The numbers in the parenthesis are the standard deviations calculated over six independent runs with different token mappings. Table 11 shows the statistics of DNA classification datasets. For the train/validation/test splits, we use randomly chosen 90% samples as training data, 5% samples as validation data, and 5% samples as testing data as Hilbert-CNN does. We do not apply any additional pre-processing for these datasets.       : The mean (bar) and std (error bar) of the L2 distance between ALBERT's outputs with and without adding noise to the model parameters. "scratch" stands for the randomly initialized parameters.