How effective is BERT without word ordering? Implications for language understanding and data privacy

Ordered word sequences contain the rich structures that define language. However, it’s often not clear if or how modern pretrained language models utilize these structures. We show that the token representations and self-attention activations within BERT are surprisingly resilient to shuffling the order of input tokens, and that for several GLUE language understanding tasks, shuffling only minimally degrades performance, e.g., by 4% for QNLI. While bleak from the perspective of language understanding, our results have positive implications for cases where copyright or ethics necessitates the consideration of bag-of-words data (vs. full documents). We simulate such a scenario for three sensitive classification tasks, demonstrating minimal performance degradation vs. releasing full language sequences.


Introduction
Masked language models (MLMs) like BERT (Devlin et al., 2019) use an ordered sequence of tokens as input. And rightfully so! Any model capable of "language understanding" undoubtedly should need access to the hierarchical, syntactic structures implicitly encoded in language. But are MLMs really doing better because they have access to full word sequences?
To assess this question, we first compare the internal representations of BERT and RoBERTa (Liu et al., 2019) when the sequence of unigrams is not available. 1 We do this by using the bagof-words counts of an input to generate a random ordering of the unigrams, i.e., "shuffling" the input. For example, in a sentiment classification corpus, if an intact input was "The movie was great!", a possible shuffled ordering might be "movie the great was" (tokenization details are in §4). We find that, though BERT appears to become more 1 We use the "base" models supplied by the authors sensitive to ordering in later layers, shuffled token representations and self-attention activations still closely resemble their unshuffled counterparts.
Following cues from prior work (Sugawara et al., 2020;Si et al., 2019;K et al., 2020), we next report the performance of pre-trained MLMs fine-tuned on GLUE, a suite of English-language understanding benchmarks, when given access only to unigram count information by handing models randomly ordered sequences of words (an approach we call BoW-BERT, for short). For most GLUE tasks, performance degradation when shuffling is minimal, e.g., MNLI, QQP, and QNLI accuracy degrade by less than 5 accuracy points.
The bad news: Despite BERT being trained on intact word sequences, BoW-BERT demonstrates that MLMs can readily ignore syntax (while maintaining strong performance) when fine-tuned for even carefully designed downstream language understanding tasks. 2 We thus advocate for reporting BoW-BERT's performance as a strong baseline.
The good news: BoW-BERT offers a practical modeling choice for researchers who must operate with only bag-of-words representations for legal or ethical reasons. 3 Bag-of-words data releases are sometimes the only legal format in which copyright-sensitive corpora may be distributed, e.g., HathiTrust 4 (16M historical volumes) (Christenson, 2011), Google N-grams (Michel et al., 2011), etc. And while ethical considerations sometimes preclude the full release of privacy-sensitive docu-205 ments (e.g., medical transcriptions), bag-of-words data release offers the potential for compromise. While releasing unigram counts is one way of anonymizing documents (Gallé and Tealdi, 2015), recent work in differential privacy (Dwork, 2008;Fernandes et al., 2019;Schein et al., 2019) has resulted in randomized algorithms capable of privatizing BoW count data (under varying definitions of privacy). 5 We explore classification tasks on three sensitive corpora, simulating different input fidelity availability: full sequences, BoW counts, and differentially private (DP) BoW counts. We find that BoW-BERT often significantly outperforms prior BoW models, especially for shorter documents. And, for longer documents, BoW-BERT can even outperform fullsequence BERT. Finally, for the (naive) DP configuration we consider, BoW-BERT is a viable option for classifying shorter privatized documents, though linear BoW models remain competitive for longer documents.

Related Work
Shuffling inputs to non-pretrained models.
Word order shuffling has been tested as part of the full training process for non-pretrained models. Sankar et al. (2019) shuffle words in a dialog corpus, and find that LSTMs are more sensitive than Transformers to word order. Khandelwal et al. (2018) show that shuffling distant context words (e.g., beyond 50 tokens) has little effect in outcome for LM-LSTMs. Adi et al. (2017) show that LSTM autoencoders encode significant ordering information when fit to a corpus of Wikipedia sentences. Nie et al. (2019) report minimal performance decreases from word shuffling while training a number of model architectures, e.g., ESIM (Chen et al., 2017), for SNLI/MNLI tasks. In a multimodal setting, Cirik et al. (2018) show that shuffling doesn't affect performance for an LSTM in a referring expression task.
Shuffling inputs to pretrained MLMs. While at the time of submission of this work, shuffling results had not been fully reported on the popular GLUE taskset, prior results have used wordshuffling as a baseline with varying results. Sugawara et al. (2020) operationalize ablations of reading comprehension skills from Kintsch (1988), and report that shuffling n-grams in 10 QA corpora results in 10-20% performance decreases for BERT. Si et al. (2019) report similar results when shuffling questions+answers in MCRC corpora, reporting absolute accuracy drops of between 5-20% when shuffling both passage/question words (e.g., BERT on DREAM drops from 63 → 41 accuracy relative to a 33% constant baseline). K et al. (2020) report that swapping tokens during pretraining of a multilingual BERT model results in moderate performance degradation for XNLI (e.g., 71 → 63 for en-es) but more significant performance degradation for NER (63 → 40 in the same setting). They find that a purely frequency-based corpus "is not enough for a reasonable cross-lingual performance." Several works have examined shuffling inputs in multi-language scenarios (e.g., translation) when languages have variable syntax (Ahmad et al., 2019;Liu et al., 2020). Zhao et al. (2020) use a random token permutation to provide a baseline. Yang et al. (2019) find that self-attention networks are surprisingly bad at identifying two tokens that are swapped in the input. Ettinger (2020) show that shuffling BERT inputs decreases word cloze prediction performance on a corpus of 102 sentences without fine-tuning.  incorporate a deshuffling objective into pre-training.
In some cases, shuffled inputs provide a stronger baseline than might be assumed, while in others, shuffling significantly degrades performance. At present, determining whether or not order is "needed" for a particular task is largely an experimental, empirical endeavor.

Syntax in MLMs. Prior works have investigated
BERT's capacity to represent syntax: some researchers have designed prediction tasks that require syntactic knowledge (Linzen et al., 2016;Jawahar et al., 2019;Lin et al., 2019;Goldberg, 2019), while others have probed representations for linguistic information directly (Mareček and Rosa, 2018;Hewitt and Manning, 2019;Reif et al., 2019). Tenney et al. (2019) find that contextual representations outperform lexical representations on many syntactic tasks, but not in a suite of semantic prediction tasks. Htut et al. (2019) and Clark et al. (2019) find that some attention heads encode information useful for dependency parsing. Glavaš and Vulić (2020) show that intermediate supervised training of a biaffine parser has little effect on downstream MLM performance.
A Bouquet of Contemporaneous Work. While this work was in submission, several related works were posted to arXiv. Gupta et al. (2021) examine NLI, paraphrase detection, and sentiment classification, and show that destructive interventions do not significantly affect either model predictions or model confidence. Sinha et al. (2020) find a similar result for NLI tasks, and, in follow-up work, Sinha et al. (2021) demonstrate pretraining is possible on unordered sequences. Pham et al. (2020) look specifically at GLUE classification for BERT-based models. Beyond contemporaneous confirmation of the GLUE results, our work contributes to this bouquet by: 1) examining internal activations/layers and 2) exploring classification settings where one might need to operate on (potentially differentially private) count-only data.

Representation analysis
We might expect that shuffling the order of tokens in an input sentence would significantly corrupt the internal representations of BERT, but is that actually the case? We investigate with two new metrics. Consider applying a pre-trained, fixed BERT model to x ="the movie was great" and the shuffled x ="movie the great was".
Token identifiability measures the similarity of BERT's vector representations of a word token (e.g., "movie") in x and x . Identifiability is high if the model has similar representations for tokens after their order is shuffled.
Self-attention distance measures if BERT attends to similar tokens for each token in x and x regardless of their order (e.g., is "the movie was great" ≈ "movie the great was" to BERT?). Self-attention distance is low if the model attends to the same tokens after input shuffling.
Token Identifiability. Let MLM l (x) be a R t×d matrix, where t is the number of tokens in sentence x, d is the MLM's dimension, and l is the layer index. In this setting, row i of MLM l (x) is the MLM's representation of the ith token in sentence x. We compare MLM l (x) to E[MLM l (X )], where X is drawn uniformly from the permutations of x: perm(x). For a specific sample x ∼ perm(x), we first take the row-wise cosine similarity of MLM l (x) and MLM l (x ), and treat the resulting t × t matrix as an instance of a bipartite linear assignment problem. The assignment accuracy (AA) score for (x, x ) is the proportion of assigned token pairs that have the same underlying word type. To avoid biasing towards shorter sentences, we take the ratio of the accuracy relative to chance, i.e., where RAND is a random matrix of reals R t×d . 6 Self-Attention Distance. Let AMLM l,h (x) be the row-l 1 -normalized R t×t matrix representing the self-attention matrix at layer l for attention head h. We can compute the same matrix for a shuffled input AMLM l,h (x ), and then perform a transformation to re-order the rows and columns of this matrix to match the original order of tokens in x, yielding AMLM x l,h (x ). We then define the row-wise Jensen-Shannon divergence DS-JSD(AMLM l,h (x), AMLM x l,h (x )) as the mean row-wise JSD between AMLM l,h (x) and the DeShuffled reordered attention matrix AMLM x l,h (x ). As before, to reduce the effect of sentence length, we normalize using RND-JSD(AMLM l,h (x), AMLM x l,h (x )), which chooses a random row/column permutation. 7 The 6 In practice, we simply compute the assignment step of AA using a R t×t matrix drawn from U [0, 1). 7 If there are multiple possible valid permutations of x that match x (e.g., if there are repeated words), DS-JSD will choose the order that minimizes the JSD, and RND-JSD will search through a number of random orderings equal to the number of valid permutations. If the number of valid permutations is > 16, 16 random valid permutations are sampled. final attention distance metric is defined as (2) Results. We randomly sample 100 sentences from each training set of 8 GLUE tasks, for a total of 800 sentences. To approximate expectations from Equations 1 and 2, we sample 32 random permutations per sentence. Figure 1 gives the per-layer token identifiability/attention similarity scores for both MLMs. For both metrics, later layers are more order sensitive to order, i.e., ID-MLM ↓ and AD-MLM ↑. Attention heads vary significantly in their order sensitivity: each attention head is a single point in the scatterplot of Figure 1b. But, even at late layers, both metrics suggest significantly more than random correspondence: internal representations of BoW-(Ro)BERT(a)clearly resemble their unshuffled counterparts.

BoW-BERT for Classification
We compare BERT and RoBERTa to their BoW counterparts on nine tasks from GLUE . 8 We run single-task training for six epochs, use early stopping, and optimize batch size ({16, 32}) and learning rate ({5, 2, 1, .5} × 10 −5 ) via grid search on the validation set. To shuffle documents: we lowercase, tokenize, remove all tokens that consist only of punctuation, shuffle, then concatenate with whitespaces. We re-shuffle the training tokens each epoch, but fix validation and test tokens to one shuffled permutation.
Results. Table 1 gives the GLUE test set results of our algorithms vs. GloVe CBOW, the best BoW baseline on the GLUE leaderboard at the time of submission. In all cases BoW-BERT outperforms CBOW. The extent to which BoW-BERT underperforms relative to BERT varies for each dataset, but in terms of relative percent performance decrease, ranges from over ↓70% for CoLA to only ↓3% QQP. Outside of CoLA, performance degradation never exceeds 10 absolute points for any task's metric.

Classification for Sensitive Texts
Privacy and legal concerns frequently necessitate BoW-only data releases. We ask: for potentially sensitive text classification tasks, how does performance degrade if only bag of words counts are available (instead of full sequences)? We consider three such tasks: Reddit controversy prediction on AskWomen/AskMen (CONT) (Hessel and Lee, 2019), offensiveness prediction in social media (SBF) (Sap et al., 2020), and sample medical transcript categorization (MTSAMP). 10 For each task, we compare models with access to sequences vs. models that can only access bag-ofwords features. Our baselines are unigram/tfidf linear models, and CBOW models GloVe and fasttext (Mikolov et al., 2018). Table 2 contains corpus 208 statistics and prediction results. For CONT and SBF, BoW-BERT outperforms all BoW methods. For all tasks, performance drop-off from a fullsequence fine-tuned MLM to its BoW counterpart is less than 1%. CBOW/tfidf remain strong for MTSAMP, in which documents are longer.
Given that de-shuffling BoW representations is at least partially possible (Tao et al., 2021), we additionally consider a more robust differentially private (DP) unigram count data release (also known as the "local model" of DP) (Warner, 1965;Dwork et al., 2006;Schein et al., 2019). We follow a process similar to  by first compressing the original unigram count matrices via Gaussian random projection to 500D. 11 In the compressed space, we add noise per-entry with the Laplace mechanism (Dwork et al., 2006) with a per-feature privacy budget of ε. Then, we invert the random projection, normalize the vector to be a categorical word distribution, and sample (unordered) pseudodocuments from the resulting distribution with length ∼ P oisson( ).
We report results in an easier setting = 256, ε = 100 and a harder setting = 128, ε = 50 in the bottom half of Table 2. For these settings of DP, the linear baselines generally outperform BoW-(Ro)BERT(a). However, MLMs are again most competitive for the shortest document setting, SBF, where BoW-(Ro)BERT(a)exceeds the best linear model performance (60.4 vs. 62.0 F1).
Taken together, these results suggest 1) that releasing word counts instead of full document sequences is a viable data release strategy for some sensitive classification tasks; 2) BoW-BERT offers a means of accessing the representational power of modern MLMs in cases where only BoW information is available; and 3) for at least some local DP settings, linear models remain competitive particularly for long documents, while BoW-RoBERTa is viable when the underlying documents are shorter.

Conclusion and Future Work
We advocate for BoW-(Ro)BERT(a)as a surprisingly strong baseline for language understanding tasks, as well as a performant practical option for 11 Our original submission used DP PCA instead. But it was brought to our attention that the paper proposing that algorithm was retracted for being non-private (+ discontinued in the library we used after we submitted). We have adjusted our code and recompiled our experiments using a comparable mechanism. Our intent isn't to advocate for this particular DP method, but rather, to fairly compare NLP algorithms on the same DP corpora.  classifying (privatized) BoW texts when documents are short. Future work includes: 1. Evaluating BoW-BERT representations on BoW-only corpora in unsupervised text clustering scenarios (vs. classification) + designing self-supervised objectives for fine-tuning MLM weights from unlabelled domain-specific BoW corpora, e.g., HathiTrust.; 2. Extending (K et al., 2020) by further exploring BoW classification using non-English MLMs, where model dependence on syntactic information may differ; 3. Designing local private data release methods better adapted to MLM fine-tuning.