Does BERT Learn as Humans Perceive? Understanding Linguistic Styles through Lexica

People convey their intention and attitude through linguistic styles of the text that they write. In this study, we investigate lexicon usages across styles throughout two lenses: human perception and machine word importance, since words differ in the strength of the stylistic cues that they provide. To collect labels of human perception, we curate a new dataset, Hummingbird, on top of benchmarking style datasets. We have crowd workers highlight the representative words in the text that makes them think the text has the following styles: politeness, sentiment, offensiveness, and five emotion types. We then compare these human word labels with word importance derived from a popular fine-tuned style classifier like BERT. Our results show that the BERT often finds content words not relevant to the target style as important words used in style prediction, but humans do not perceive the same way even though for some styles (e.g., positive sentiment and joy) human- and machine-identified words share significant overlap for some styles.


Introduction
To express their interpersonal goal and attitude, people often use different styles in their communication. The style of a text can be as important as its literal meaning for effective communication (Hovy, 1987). NLP researchers have built many models to identify different styles in text, including politeness (Danescu-Niculescu-Mizil et al., 2013), emotion (Alm et al., 2005;Mohammad et al., 2018), and sentiment (Socher et al., 2013). Recently, transformerbased (Vaswani et al., 2017) pretrained language models, such as BERT (Devlin et al., 2019), have achieved impressive performance on many NLP tasks, including stylistic studies. However, explaining what these deep learning models learn remains * research conducted at the University of Pennsylvania 1 Our dataset and code are available at https:// github.com/sweetpeach/hummingbird I will understand if you decline, but would very much like you to accept. May I nominate you?  Figure 1: Both humans and BERT models label the sentence (a) as "polite", whereas in sentence (b), the humans label it as "anger" but BERT does not. Pink highlight: high human perception score. Blue: BERT's important words. Purple: the word is seen as a strong cue by both human and BERT. The darker color means that the score for human perception or machine word importance is higher. Best seen in color. a challenge. Thus, there is a growing effort to understand how these models behave (Rogers et al., 2021;Rajagopal et al., 2021). In this work, we attempt to understand style variation through the contrasting words identified by humans and BERT as determining a style. Given the subjective nature of styles, we are interested in capturing human's inherent perception of stylistic cues in the text and compare this with the BERT's "perception". Specifically, we investigate the extent to which BERT's word importance, as estimated using Shapley value-based attribution scores (Mudrakarta et al., 2018), aligns with human perception in stylistic text classification.
When humans identify styles in a text, specific words play an important role in recognizing the style, such as hedges for identifying politeness (Danescu-Niculescu-Mizil et al., 2013). We call such words stylistic cues. For example, in Figure 1(a), humans perceive the words "understand," "like," and "accept" as strong stylistic cues for politeness. But does the BERT model learn the same words as indicative? It turns out that although the model learns that the word "accept" is an important feature for classifying the text as polite, it disagrees with humans for "understand" and "like" by identifying these words as signals for impoliteness. This leads to a concern that lexical explanation from BERT could be unreliable and motivates us to investigate more deeply into the lexical cues used by humans and BERT. Since styles overlap significantly (Kang and Hovy, 2021), we cover multiple styles: politeness, sentiment, offensiveness, anger, disgust, fear, joy, and sadness.
Our contributions are as follows: • This is the first comparative study to examine stylistic lexical cues from human perception and BERT. To characterize their discrepancy, we developed a dataset, called HUMMING-BIRD, where crowd-workers relabeled benchmarking datasets for style classification tasks. • We found that human and BERT cues are quite different; BERT pays more attention to content words, and word-level human labels provide more accurate multi-style correlations than sentence-level machine predictions. • Our work differs from previous works which have generated stylistic lexica from manuallycurated seed words or thesauri (Davidson et al., 2017;Mohammad and Turney, 2010); Instead, in our work, the full text is given to annotators, providing more context to the selection of the cue words.

Collection of Human and BERT's Importance Scores on Stylistic Words
While there are many datasets with stylistic labels, to the best of our knowledge, there is no available dataset of stylistic texts with human labels on the individual words that drive the human perception. Therefore, on top of existing benchmark style datasets, we we develop HUMMINGBIRD, a new dataset with human-identified stylistic words in those stylistic sentences . Human Perception Scores To collect human perception scores, we first pick 500 stylisticallydiverse texts from the four style datasets by the following method. First, we fine-tune BERT on the training sets of the exiting datasets using the original train/dev/test splits. The models' performance is shown in Table 1. We then run each model on every development set. For example, we run a sentiment classifier on our emotion dataset. From this, we obtain the probability score from the model for predicting each style.
To encourage that the chosen texts exhibit diverse styles, we sort them based on their probability scores and compute the standard deviation of these scores across the eight styles, following Kang and Hovy (2021). We then select the 50 most polite texts, 50 most impolite texts, 50 positive texts, 50 negative texts, 100 offensive texts, and 200 emotional texts (40 from each emotion style), resulting in total 500 texts from the four different style datasets.
We hired 622 workers to annotate them with human perception on Prolific 2 from November to December 2020. We required the workers to be in the United States and payed them an average of $9.6/hour. Each worker was asked what styles they perceive each of the texts to exhibit. If they think the text has certain styles, workers then highlight the words in the text which they believe make them think the text had those styles (Pink highlights in Figure 1). Three workers label the same pair of sentence and style, and we take majority voting for the style labels.
3 Crowd-workers obtained an average per-  Table 2: Top 5 words where humans and BERT agree or disagree. ↑ ↑: both human and BERT agree. ↑: high human perception score but low BERT's importance score. ↑: high BERT's importance score but low human perception score. BERT-only agreement includes more content words ( * ) or interjections ( # ) than human-only agreement.
centage agreement of 73.2% on majority labeling, which is a substantial agreement, for text-level as shown in Table 1 and an average percentage agreement of 27.7% for word-level agreement. Then, for a word w i in a text t = w 1 ..w N , the human perception score is defined as: where h j ∈ −1, 0, 1 is the score given by the j th annotator. Each annotator's label will contribute a score of either 1 for a word that is perceived as a positive cue, -1 for a negative cue, and otherwise 0 (neutral or no emotion).
BERT's Importance Scores To obtain the word importance (attribution) scores from BERT, we first trained BERT-based models, yielding with F1 scores in Table 1. We then use the popular technique of layered integrated gradients (Mudrakarta et al., 2018) provided by Captum (Kokhlikyan et al., 2020). This technique is a variant of integrated gradients, an interpretability algorithm that attributes an importance score to each input feature by approximating the integral of the gradients of the model's output with respect to the inputs along a straight line from given baselines to the inputs (Sundararajan et al., 2017).
Since BERT could tokenize a word w into several word pieces, the importance of a word is an average of the scores of the word pieces x that make up the word. For an input of word pieces x, if we have a function F ∶ R n → [0, 1] as a neural network, and an input x = (x 1 , ..., x n ) ∈ R n , an attribution of the prediction at input x relative to a baseline input x ′ is a vector A F = (x, x ′ ) = (a 1 , ..., a n ) ∈ R n where a i is the attribution of x i to the prediction F (x). We use the default setting of Captum for the baseline input x ′ which is zero scalar. Finally, we obtain [-1,1] attribution score for each token like the blue highlights in Figure 1.

Human-BERT Agreement through Lexical Analysis
We study how similar human perception and BERT's word importance are, within each style (intra-style) and across styles (multi-styles).

Intra-stylistic Analyses
We measure the correlation between human perception of stylistic words and BERT's word importance, by computing the Pearson's r for them across all words in the vocabulary, as shown in Figure 2. Naïve refers to our baseline which is that we simply count word frequencies in the stylistic text. For example, if the style is positive sentiment, for a word w, we computed how many times w appears for sentences labeled as "positive". We calculated the Pearson's r between this word count and the sentences' styles across all sentences. This Pearson's r score is the baseline score of the word importance for word w.
We find that BERT's word importances correlate more highly with human judgements than this baseline; neither BERT nor humans rely purely on co-occurrence frequencies. Some styles are easier to identify by both human and BERT, such as joy and sentiment with Pearson r=0.288 and 0.273. The yellow bar suggests that human-BERT agree- ment is higher when the word appears more often, especially for offensiveness (0.088 vs. 0.224).
We now look into which words BERT and humans agree and disagree on. Table 2 shows such words selected based on the difference of the word ranks of the human perception score and those from BERT's word importance. To include only highly stylistic words, words are selected only if their scores are greater than a threshold of 0.3. When humans and BERT agree ( ↑ ↑), they attend to words that are clearly associated with the styles (e.g joy, positive) and are general ("lovely", "delightful", "excited").
In contrast, BERT often finds words that suggest contexts in which the sentiment is likely to occur. For example, top-5 words from BERT-only agreement ( ↑) contain more content words such as "scenes" for politeness and "movies" and "basebell" for joy than those from human-only agreement ( ↑). In particular, we see that for politeness and positive sentiment, BERT pays more attention to interjections (e.g., "hi", "wow") than humans. For offensiveness and fear in Table 4 in the Ap-pendix, humans perceive hashtags as important cues but BERT does not. Interestingly, humans perceive a seemingly positive word, "charming," as offensive while BERT does not, perhaps missing sarcasm. These content words or words irrelevant to the target style are mostly learned due to the biased training dataset, leading to inaccurate prediction by the machine.
Then, we evaluate the impact of important words perceived by human and BERT in the existing test set using a simple occurrence-based classification method. From the ranked word list by their human perception score and BERT's word importance scores, we label a text as having the target style, if at least one word in the test sentence exists in the top-N word list. For this study, we only select words that appear three times or more in the dataset.
In Figure 3, human's word list outperforms BERT's for most styles, even with this small size of annotations compared to the large size of original datasets used for training the BERT model. Interestingly, for some negative styles (e.g., impoliteness, negative sentiment, fear), BERT's word list performs better. We observe that words from offensive dataset (mostly swear words) are more consistently labeled as impolite and negative by human annotators. However, these words are not often seen in the original politeness and sentiment datasets. It explains why features from BERT models which are trained on the original, large datasets get higher F1 score. As for fear, we found that content words, such as "facebook" and "theatre", appear in the test data. Here we see that BERT relies on content words (topic-related words) to help predict the style, which is fragile to out-of-domain samples.

Multi-stylistic Analyses
As we extend our analyses on multi-style correlation from a lexical viewpoint, we found that humans and machines give similar correlations among the styles. For instance, joy, positive sentiment, and politeness are all positively correlated, as are anger, disgust, and offensiveness ( Figure 4). However, the multi-style correlation strength is greater for human perceptions than for machine importance.
The weaker correlation across styles for machines is confirmed in Figure 5, which presents a lower-dimensional visualization for the stylistic representation of each word. Stylistic words are more clustered in human perception, while for BERT, the separation between highly stylistic words and non-stylistic words is less clear. Figure 5 also shows the geometric closeness across the style clusters, giving extra information beyond the pairwise correlations in Figure 4. In human scores, styles cluster into two extremes: politeness, positive sentiment, and joy to the left, and anger, negative sentiment, offensiveness, and impoliteness to the right, with disgust, fear, and sadness, between them. This leads to more accurate style correlation analysis than machine-based analysis on the text level (Kang and Hovy, 2021).

Conclusion
We showed that BERT's word importances for style prediction, as calculated using integrated gradients, correspond only very loosely with the word importances given by human annotators. These differences likely result from several factors: 1) Word-importances computed for words which appear rarely in the text tend to be noisy. 2) BERT, as a contextual pretrained model, take more context into account for deciding the style of the text while human intuitively choose the most obvious "stylistic" words to judge the style of the text. 3) Styles are subjective matter, so human annotators may have different perception toward the style of a sentence.
Future Directions This work also provides a public dataset as a first step for researchers to further investigate these issues. We plan to scale up our data collection in their size and style types including other higher-level of styles such as sarcasm and humor. We also explore a possibility of informing BERT to pay more attention on humanannotated lexica.
Limitations We acknowledge that while the inter-annotator agreement for the sentence-level for human (top) and machine (bottom). Each word is represented as a vector of its perception score for the styles in this order: politeness, sentiment, offensiveness, anger, disgust, fear, joy, and sadness.
style is quite high, there is a huge variation for the word-level agreement. As a caveat, the annotators could be unreliable. We do find that annotators label different words as being important than those that drive BERT predictions. Note that we do not claim that BERT is "wrong" and humans are "always reliable"; only that they are different. BERT's important words can help the model predict correctly, but they are perceived as stylistic features as humans do. Studying this difference is our major goal of this paper. We believe that if a word is perceived as "stylistic" by the majority of people, this word can be regarded as an important cue for the model. Learning this variability of human perception on styles could be an interesting future work using HUMMINGBIRD.

Ethical Considerations
A full analysis of style, such as politeness or expression of anger, depends upon the context of the utterance: who is saying it to whom in what situation. Such analysis is beyond the scope of this work, which looks only at how the style of the utterance is perceived without context by a small number of crowd workers. Methods such as we have used here should be extended to look at the more subtle contextual interpretations of style and, eventually, at the ways in which perceived styles may differ from intended styles. Many people have (correctly) drawn attention to the role that (mis)perceptions of style can foster gender or racial discrimination (Kang and Hovy, 2021). Closer attention to the words which drive style perception is an important first step towards addressing such problems.
Commercial platforms such as Crystal, Grammarly, and Textio offer "style checkers". Such software would benefit from analyses that extend the work presented here, in that they could compare the words that human editors suggest indicate a given style to the words that NLP methods select as most important for recognizing different styles. Such comparisons, particularly when contextualized, should allow construction of better software to help writers control the effect their writing has on the people reading it.

A Existing Datasets for Style Classification
We use existing style datasets from StanfordPoliteness (Danescu-Niculescu-Mizil et al., 2013) for politeness, Sentiment TreeBank (Socher et al., 2013) for sentiment, (Davidson et al., 2017)'s dataset for offensiveness, and SemEval 2018 Task 1: Affect in Tweets for emotion classification (Mohammad et al., 2018). We convert non-binary labels or scores to binary labels to standardize the multistyle analysis, resulting in eight styles. Table 3 shows their dataset sizes and train/dev/test splits.  StanfordPoliteness is collected from StackExchange and Wikipedia requests. The labels are continous values of [-2, 2] so we convert it to binary labels of "polite" and "impolite" by converting all values greater than 0 as polite and the rest are impolite. Sentiment TreeBank dataset consists of movie review texts, and we only use the coarse label of "positive" and "negative" labels for training. Davidson et al. (2017) collected their data from Twitter, and we only consider "offensive" and "none" labels. SemEval 2018 dataset is collected from tweets and it has total 11 emotions for the same10.9k instances: anger, anticipation, disgust, fear, joy, love, optimism, pessimism, sadness, surprise, and trust. We select anger, disgust, fear, joy, and sadness since these emotions have the highest F1-score compared to the rest. Each emotion has two labels: "anger" or "not anger", "disgust" or "not disgust", and so on.

B Training Configuration
We use the lower-cased BERT-base model with 12 hidden layers, 12 attention heads, hidden size 768, for training our style classifiers on GeForce GTX TITAN X GPU. Drop-out rate is 0.1, learning rate is 2 × 10 −5 , and the optimizer is AdamW (Loshchilov and Hutter, 2017). Vocabulary size is 30,522 and max position embeddings is 512. Training ran for

C Annotation Interface
For each text-style pair (total: 500 texts × 8 styles = 4,000 pairs), we ask three different annotators to select the style label for text and highlight 463 lovely  hilarious  disappointed  delightful  deep  shocking  excited  moved  movies  delightful  thank  scenes  lovely  thanks  scare  love  share  managing  loving  moved  suffers  smart  fun  move  entertaining performances  referring  smart  good  hi  solid  deftly  absolutely  great  congrats  documentary  trouble  clear  optimism  excited  best  wow  perfect  smile  baseball  happy  share  weather  hilarious  high  optimism  loving  example  audience  charming  friend  sounds  great  pretty  news  amazing  morning  scenes  compellling  rest  genre Table 4: Top 10 words where humans and BERT agree and disagree for all the eight styles. We only select words that appear >= 2. ↑ ↑: both human and BERT agree. ↑: high human perception score but low word importance score. ↑: high word importance score but low human perception score. the words which make them think the text has that style with instructions shown in Figure 6. To guarantee that the workers are serious with this task, we provide a screening practice session which resembles the exact task but with a text that is very obvious to be annotated as in Figure 7. The real task interface is also the same as

D Important Words Perceived by
Humans and the Machine Table 4 shows top twenty words where humans and BERT agree and disagree for all styles.