Can BERT eat RuCoLA? Topological Data Analysis to Explain

This paper investigates how Transformer language models (LMs) fine-tuned for acceptability classification capture linguistic features. Our approach is based on best practices of topological data analysis (TDA) in NLP: we construct directed attention graphs from attention matrices, derive topological features from them and feed them to linear classifiers. We introduce two novel features, chordality and the matching number, and show that TDA-based classifiers outperform fine-tuning baselines. We experiment with two datasets, CoLA and RuCoLA, in English and Russian, which are typologically different languages. On top of that, we propose several black-box introspection techniques aimed at detecting changes in the attention mode of the LM’s during fine-tuning, defining the LM’s prediction confidences, and associating individual heads with fine-grained grammar phenomena. Our results contribute to understanding the behaviour of monolingual LMs in the acceptability classification task, provide insights into the functional roles of attention heads, and highlight the advantages of TDA-based approaches for analyzing LMs.We release the code and the experimental results for further uptake.


Introduction
Language modelling with Transformer (Vaswani et al., 2017) has become a standard approach to acceptability judgements, providing results on par with the human baseline (Warstadt et al., 2019).The pre-trained encoders and BERT, in particular, were proven to have an advantage over other models, especially when judging the acceptability of sentences with long-distance dependencies (Warstadt and Bowman, 2019).Research examining linguistic knowledge of BERT-based language 1 Arugula or rocket salad in English 2 https://github.com/upunaprosk/la-tdamodels (LMs) revealed that: (1) individual attention heads can store syntax, semantics or both kinds of linguistic information (Jo and Myaeng, 2020;Clark et al., 2019), (2) vertical, diagonal and block attention patterns could frequently repeat across the layers (Kovaleva et al., 2019), and (3) fine-tuning affects the linguistic features encoding tending to lose some of the pre-trained model knowledge (Miaschi et al., 2020).However, less attention has been paid to examining the grammatical knowledge of LMs in languages other than English.The existing work devoted to the cross-lingual probing showed that grammatical knowledge of Transformer LMs is adapted to the downstream language; in the case of Russian, the interpretation of results cannot be easily explained (Ravishankar et al., 2019).However, LMs are more insensitive towards granular perturbations when processing texts in languages with free word order, such as Russian (Taktasheva et al., 2021).
In this paper, we probe the linguistic features captured by the Transformer LMs, fine-tuned for acceptability classification in Russian.Following recent advances in acceptability classification, we use the Russian corpus of linguistic acceptability (RUCOLA) (Mikhailov et al., 2022), covering tense and word order violations, errors in the construction of subordinate clauses and indefinite pronoun usage, and other related grammatical phenomena.We provide an example of an unacceptable sentence from RUCOLA with a morphological violation in the pronoun usage: a possessive reflexive pronoun 'svoj' (oneself's/own) instead of the 3rd person pronoun.
(1) * Eto byl pervyj chempionat mira v svoej kar'ere.("It was the first world championship in own career.") Following the recently proposed Topological Data Analysis (TDA) based approach to the linguistic acceptability (LA) task (Cherniavskii et al., 2022), we construct directed attention graphs from attention matrices and then refer to the characteristics of the graphs as to the linguistic features learnt by the model.We extend the existing research on the acceptability classification task to the Russian language and show the advantages of the TDA-based approach to the task.Our main contributions are the following: (i) we investigate the monolingual behaviour of LMs in acceptability classification tasks in the Russian and English languages, using a TDA-based approach, (ii) we introduce new topological features and outperform previously established baselines, (iii) we suggest a new TDA-based approach for measuring the distance between pretrained and fine-tuned LMs with large and base configurations.(iv) We determine the roles of attention heads in the context of LA tasks in Russian and English.
Our initial hypothesis is that there is a difference in the structure of attention graphs between the languages, especially for the sentences with morphological, syntactic, and semantic violations.We analyze the relationship between models by comparing the features of the attention graphs.To the best of our knowledge, our research is one of the first attempts to analyse the differences in monolingual LMs fine-tuned on acceptability classification corpora in Russian and English, using the TDAbased approach.

Related Work
Acceptability Classification.First studies performed acceptability classification with statistical machine learning methods, rule-based systems, and context-free grammars (Cherry and Quirk, 2008;Wagner et al., 2009;Post, 2011).Alternative approaches use threshold scoring functions to estimate the likelihood of a sentence (Lau et al., 2020).Recent research has been centered on the ability of omnipresent Transformer LMs to judge acceptability (Wang et al., 2018), to probe for their grammar acquisition (Zhang et al., 2021), and evaluate semantic correctness in language generation (Batra et al., 2021).In this project, we develop acceptability classification methods and apply them to datasets in two different languages, English and Russian.
Topological Data Analysis (TDA) in NLP.Recent work uses TDA to explore the inner workings of LMs.Kushnareva et al. (2021) derive TDA features from attention maps to build artifi-cial text detection.Colombo et al. (2021) introduce BARYSCORE, an automatic evaluation metric for text generation that relies on Wasserstein distance and barycenters.Chauhan and Kaul (2022) develop a scoring function which captures the homology of the high-dimensional hidden representations, and is aimed at test accuracy prediction.We extend the set of persistent features proposed by Cherniavskii et al. (2022) for acceptability classification and conduct an extensive analysis of how the persistent features contribute to the classifier's performance.
How do LMs change via fine-tuning?There have been two streams of studies of how fine-tuning affects the inner working of LM's: (i) what do subword representation capture and (ii) what are the functional roles of attention heads?The experimental techniques include similarity analysis between the weights of source and fine-tuned checkpoints (Clark et al., 2019), training probing classifiers (Durrani et al., 2021), computing feature importance scores (Atanasova et al., 2020), the dimensionality reduction of sub-word representations (Alammar, 2021).Findings help to improve fine-tuning procedures by modifying loss functions (Elazar et al., 2021) and provide techniques for explaining LMs' predictions (Danilevsky et al., 2020).Our approach reveals the linguistic competence of attention heads by associating head-specific persistent features with fine-grained linguistic phenomena.

Methodology
We follow Warstadt et al., 2019 and treat the LA task as a supervised classification problem.We finetune Transformer LMs to approximate the function that maps an input sentence to a target class: acceptable or unacceptable.

Extracted Features
Given an input text, we extract output attention matrices from Transformer LMs and follow Kushnareva et al., 2021 to compute three types of persistent features over them.
Topological features are properties of attention graphs.We provide an example of an attention graph constructed upon the attention matrix in Fig- ure 1.An adjacency matrix of attention graph A = (a ij ) n×n is obtained from the attention matrix W = (w ij ) n×n , using a pre-defined threshold thr: where w ij is an attention weight between tokens i and j and n is the number of tokens in the input sequence.Each token corresponds to a graph node.Features of directed attention graphs include the number of strongly connected components, edges, simple cycles and average vertex degree.The properties of undirected graphs include the first two Betti numbers: the number of connected components and the number of simple cycles.We propose two new features of the undirected attention graphs: the matching number and the chordality.The matching number is the maximum matching size in the graph, i.e. the largest possible set of edges with no common nodes.Consider an attention matrix depicted in Figure 1a and a simple undirected attention graph (Figure 1c) constructed based on the bipartite graph (Figure 1b) with a threshold of 0.1.The matching number of that attention graph is equal to two.One example of a maximum matching in that graph is a set of edges: {(John -sang), ([SEP] -[CLS])}.That matching is maximum because there are no more edges that are not incident to the already matched 4 nodes (tokens).The chordality is a binary feature showing whether the attention graph is chordal; that is, whether the attention graph does not contain induced cycles of a length greater than 3.For example, the plotted graph in Figure 1c is chordal because it does not contain induced cycles with more than 3 edges.If there were no dotted edges (chords) in the graph, there would be a cycle [SEP]-beautifully-sang-[CLS]-[SEP] of length 4, meaning that the graph is not chordal.
We expect these novel features to express syntax phenomena of the input text.The chordality feature could carry information about subject-verb-object triplets.The maximum matching can correspond to matching sentence segments (subordinate clauses, adverbials, participles, introductory phrases, etc.).
Features derived from barcodes include descriptive characteristics of 0/1-dimensional barcodes and reflect the survival (death and birth) of connected components and edges throughout the filtration.
Distance-to-pattern features measure the distance between attention matrices and identity matrices of pre-defined attention patterns, such as attention to the first token [CLS] and to the last [SEP] of the sequence, attention to previous and next token and to punctuation marks (Clark et al., 2019).We use a publicly available implementation to compute features.3

Experimental Framework
Data We use two publicly available LA benchmarks in two typologically different languages: Russian (RUCOLA; Mikhailov et al., 2022) and English (COLA; Warstadt et al., 2019).Both selected corpora consist of in-and out-of-domain data and contain sentences collected from linguistics publications; each is marked as acceptable or unacceptable.Unacceptable sentences are annotated with syntactic, morphological and semantic phenomena violated in them.RUCOLA, in addition, covers synthetically generated data by generative LMs.We provide examples of acceptable sentences from observed corpora (2a, 3a) along with sentences with semantic violations (2b, 3b).
(2) a.The dog bit the cat.
where M t and M 0 are fine-tuned and frozen models respectively, N is number of sentences, H is a number of attention heads (H = 12 for baseconfiguration LMs, H = 24 for large LMs), K is the number of tokens in the sentence n, and W h t (token i ) is an attention weight of attention head h at token i in model M t .
Second, we estimate the difference between attention graphs as an average correlation distance between the TDA ext features across attention heads: where F is the number of features, V h tf are values of the feature f , computed over attention matrix W h t , extracted from the model M t .+0.252 at most for the Russian LMs and a more substantial +0.504 increase falling on En-BERT.Proposed chordality and matching number features are beneficial and help improve performance, proving that they capture linguistic information.Unlike base LMs, large frozen LMs exhibit grammatical knowledge even before fine-tuning.Base LMs' MCC scores fluctuate around zero, while large LMs achieve at least 0.3 MCC.

Acceptability Classification
That observation aligns with the recent works showing that pre-trained large En-RoBERTa can achieve competitive scores without further finetuning in tasks such as lexical complexity prediction (Rao et al., 2021).
At the same time, TDA classifiers outperform fine-tuned models by a minor margin enhancing scores by at best +0.064 MCC for Russian and +0.092 MCC for English.We believe that finetuning may cause the LM to lose general grammatical skills and forget language phenomena that are not present in the fine-tuning set (Miaschi et al., 2020).Thus, the features extracted from the finetuned models may require a thorough feature selection with non-linear models to mitigate feature redundancy issues.TDA classifies for RUCOLA achieve scores on par with the baseline LMs.However, for COLA, the TDA ext classifier coupled with En-RoBERTa outperforms the baseline.We report classification results on OOD test data in Table 7 and Table 8, Appendix B.1.

Sensitivity to Violation Categories
Next, we analyze gains in recall by TDA classifiers with respect to violation category.Table 2 reports scores of Ru-BERT and En-BERT baselines and TDA classifiers averaged between IDD and OODD sets with respect to 5 grammatical violations.TDA classifiers outperform LMs in unacceptable sentences; that uptrend holds for both languages, while there is a drop for acceptable sentences.
In contrast to English, the TDA ext classifier trained on Ru-BERT features is more sensitive to syntactic violations reaching the overall 76.6 recall; that is, the increase in the score is around 20 recall points, compared to fine-tuned Ru-BERT.As for the rest grammar categories, the TDA ext classifier outperforms the fine-tuned Ru-BERT by a large margin, especially in sentences with wordlevel morphological violations, where the recall of Ru-BERT is more than doubled.
Next, we manually analyze the errors of the fine-tuned Ru-BERT and our classifier TDA ext in OODD sentences in Russian.First, we compare the unacceptable sentences, which are misclassified by Ru-BERT but correctly classified by the TDA ext classifier.We find that the error span in OODD sentences is relatively short, with at most three tokens.In particular, in these sentences, such violations as non-existing words are most often encountered, the misuse of which is quite common among native speakers (4a, word formation error 'ekhaj'), local inverse word order (4b), or nonsense (4c).Common false predictions of both models include long sentences that mix grammatical phenomena, contain long-distance agreement violations and complex errors in punctuation.("There are in this forest wolves.")c. * Oni chitali moi zhaloby na sebya.
("They read my complaints onto themselves.") The domain shift from ID to OOD introduces new types of unacceptable phenomena are not present in ID data.Overall, the scores for OOD data are lower than for ID data (

Fine-tuning Effect
We investigate the dynamics of LM fine-tuning and measure per layer distance between TDA ext features extracted from frozen and fine-tuned LMs on OODD subsets ( §3.3). Figure 2 illustrates layerwise feature distance and JS divergence for Ru-BERT and En-BERT (Figure 3, Appendix B.2 for large models).Overall, we find that the distance between features rises steadily from the bottom to higher layers, whilst for English LMs, the most noticeable changes occur only in the last four layers.That observation implies that there is a noticeable difference in fine-tuning dynamics between En-BERT and Ru-BERT.
For both languages, the feature distance trend differs from JS divergence, especially in the first six layers.This indicates that the TDA ext features can be used to detect minor changes in the lower layers that are poorly expressed when using the JS divergence.For example, TDA-based distance is sensitive to small changes in the attention weights at lower predefined thresholds where large attention weights remain unchanged.JS divergence is not capable of capturing such cases.
The distance between features is uniform with respect to the violation category.The trends for acceptable and unacceptable sentences almost coincide, albeit there are noticeable differences in JS divergence.For Russian models, JS divergence in sentences with syntactic violations and hallucinations is more evident in higher layers compared to other categories.In turn, the JS divergence for English shows that the attention mode is more consistent with the frozen En-BERT on the sentences with semantic and syntactic violations; for acceptable and other sentences, the peak is reached at the penultimate layer.Similar to LMs with the base configuration, there is a steady increase in feature dissimilarity across all the layers, while for English, the main changes appear in higher layers.

Head Importance
We probe linguistic phenomena with the help of persistent features: we exploit the learnt feature weights in the linear classifiers (Appendix B.3).The higher the weight of the feature, the more it contributes to the final prediction.We aggregate

Sentence
Feature Head
[CLS] (11,0) ("The store closing at two o'clock yesterday.")Table 3: Examples of the most important Ru-BERT TDA ext features for judging RUCOLA unacceptable sentences by error type.c = the number of simple cycles in a graph, thr = threshold used for constructing attention graph, [CLS] = distance-to-[CLS]-token.
features derived from each head: the importance of the head is derived as a number of important features.We define two types of heads: (1) heads that contribute the most to true positive and true negative predictions (i.e.correct predictions), dubbed as agreeing heads, and (2) heads that contribute the most to false negative and false positive predictions (i.e.classifier's errors), dubbed as disagreeing heads.First, we explore the importance of each individual head.Figure 4, Appendix B.4 shows how important the head is for the final prediction.En-BERT and Ru-BERT have similar patterns for the heads of type (1) as the most useful features for Ru-BERT are housed in middle to higher layers.For En-BERT, these tend to be localized mostly in the last two layers.
Next, we compute the feature importance with respect to the violation category.Heads of middle layers contribute more to detecting syntactic and morphology violations in English and Russian.Heads of type (2) do not overlap with the heads of type (1) with a few exceptions, which are head 10 and head 0 from the last layer of Ru-BERT and En-BERT, respectively.Judging by the number of type (2) heads Ru-BERT struggles the most to distinguish sentences with hallucinations from acceptable sentences.This might be due to multiple reasons: (i) hallucinated sentences are not seen during training, (ii) hallucinated sentences are mainly well-formed but semantically incorrect, so there are no surface or syntactical clues to rely on.
Next, we determine the set of sentences that are the most challenging for the TDA classifier and, thus, the corresponding LM since TDA features are extracted from its attention map.To do so, we define the LM's confidence as the sum of absolute feature weights for predicting acceptable and un-acceptable classes.The lower the score, the more confused the LM is and the more attention heads tend to disagree with the desired prediction.We consider those sentences challenging that obtain the lowest confidence scores.The most challenging sentences are long, consist of multiple clauses and contain terms or named entities, see the unacceptable sentence in 5 for example.For the sake of completeness, we conduct the same analysis for COLA sentences and provide an example of the most confusing sentence for TDA ext classifier (6).The results align well.The most challenging sentences contain long-distance dependencies and named entities.
("This group found (poorly), that the northern watershed of the Merrimack was near what is now known as Lake Vinnipesaukee in New Gampshire.")(6) * Gould's performance of Bach on the piano doesn't please me anywhere as much as Ross's on the harpsichord.
Finally, we explore the feature contribution on the sentence level.Our TDA-based approach allows explaining predictions for every single sentence.To this end, the contribution (=importance) of each feature is the feature value multiplied by the learnt weight of the linear classifier.We observe the following patterns across unacceptable sentences in Russian and Ru-BERT: 1. Distance-to-pattern features appear to be useful for classifying unacceptable sentences with word-level violations, including spelling, punctuation, and agreement errors; 2. Topological features and features derived from barcodes contribute equally to more complicated grammatical phenomena.
Table 3 provides examples of unacceptable sentences along with the feature importance values.Chordality, the matching number, the number of simple cycles, and the average vertex degree derived at thresholds 0.1 or 0.25 frequently become important to predict unacceptable sentences in Russian.Similarly, the average number of vertex degrees has the most discriminative power for English and En-BERT.Important features are housed across different layers in the LMs.For English, the most important features are extracted from the last layer, while for Russian, they appear at the earliest at layer 6.
However, when it comes to the discrepancy in attention graphs between acceptable and unacceptable sentences, we find the following common for both languages.The number of connected components in attention graphs for unacceptable sentences is larger at the lowest and the highest thresholds.At the highest threshold, these components consist of one token; at the lowest one, they consist of a few ones.It means that the values of attention maps in unacceptable sentences do not deviate much from each other.On the contrary, for acceptable sentences, there is a tendency to put the most attention weight on a single token, which is usually the sentence's head verb.In terms of the TDA feature values, this effect can be seen as the sign of the correlation coefficient between the feature value and the target class correlation.Thus, there is an obvious shift towards positive correlation at a threshold of 0.5 for average vertex degree features (Figure 5).
To sum up, such an analysis helps better explain the classifiers' prediction.Since persistent features are attributed to individual heads, we can trace the role and importance of each head.A fine-grained annotation of language phenomena allows us to associate specific linguistic skills with individual heads.

Conclusion
In this paper, we adopt and improve methods for acceptability classification by using best practices from topological data analysis (TDA).We show-case the developed methods in two typologically different languages by using the datasets in English and Russian, COLA and RUCOLA, respectively.In particular, we introduce two novel features, chordality and the matching number, and compare the performance of TDA-based classifiers to fine-tuning.TDA-based classifiers boost the performance of pre-trained language models.
TDA-based classifiers have advantages over LM fine-tuning because they are more interpretable and help to introspect the inner workings of LMs.To this end, we introduce a TDA feature-based distance measure to detect changes in the attention mode of LMs during fine-tuning.This distance measure is sensitive even to small changes occurring at the bottom layers of LMs that are not detected by the widespread Jensen-Shannon divergence.What is more important, we show how TDA features reveal the functional roles of attention heads.We compare heads that contribute to making correct and incorrect predictions based on their importance.This way we discover heads that store information about word order, word derivation, and complex semantic phenomena in unacceptable sentences and heads that attend to acceptable sentences.
Given the sentence, we evaluate the prediction confidence based on the contribution of the features.We determine the set of sentences in which LMs are less confident and find that those sentences usually consist of multiple clauses and frequently include named entities.Finally, we find a distinct pattern that is frequently present in the attention maps of unacceptable sentences in English and Russian.
We hope that our results shed light on the performance of LMs in Russian and English and help understanding their fine-tuning dynamics and the functional roles of attention heads.We are excited to see the adoption of TDA by NLP practitioners to other languages and downstream problems.

Limitations
Acceptability judgments datasets Acceptability judgments datasets use linguistic literature as source of unacceptable sentences.Such approach is subject to criticism on two counts: (i) the reliability and reproducibility of acceptability judgments (Gibson and Fedorenko, 2013;Culicover and Jackendoff, 2010;Sprouse and Almeida, 2013;Linzen and Oseki, 2018), (ii) representativeness, as linguists' judgments may not reflect the errors that speakers tend to produce (D ąbrowska, 2010).

Computational complexity
The computation complexity of the proposed features is linear.For chordality features, we rely on the implementation of linear O(|E| + |V |) time algorithm (Tarjan and Yannakakis, 1984), where |E| and |V | are the numbers of edges and nodes, respectively.We use a greedy algorithm with linear complexity O(|E|) to find the maximum matching.When calculating simple cycles with the exponential-time algorithm (in the worst case), we use a constraint equal to 500 to do an early stopping.We suggest that simple cycles features are less informative when that value is exceeded.Kushnareva et al., 2021 discuss the time complexity of the rest features.

Figure 1 :
Figure 1: An example of an attention map (a) and the corresponding bipartite (b) and attention (c) graphs for the COLA sentence "John sang beautifully".The graphs are constructed with a threshold equal to 0.1.

Figure 2 :
Figure 2: Per-layer feature distance and JS divergence of attention scores between the frozen and fine-tuned Ru-BERT and En-BERT.

Table 4 (
Appendix A) reports statistics of the used corpora.For per-category evaluation, we use RUCOLA error annotations, and for COLA, we use minor grammatical phenomena annotations to group erroneous sentences.We provide more details in Table 5 (Appendix A).Fine-tuning EffectWe estimate changes in attention weights between pre-trained and fine-tuned LMs with two methods.First, we follow Hao et al., 2020 and employ Jensen-Shannon (JS) divergence: (Liu et al., 2019)19)odel architectures, finetuning and evaluation scripts are taken from the Transformers library(Wolf et al., 2020).We use the following case-sensitive monolingual Transformer LMs for the experiments: (1) base size En-BERT 4(Devlin et al., 2019)and Ru-BERT,5(2) large size En-RoBERTa 6(Liu et al., 2019)and Ru-RoBERTa.7To estimate the effect of finetuning, we compare two types of models: pretrained LMs with frozen weights (frozen) and finetuned LMs on the training sets.Transformer LMs are fine-tuned for 5 epochs on in-domain training data, with a batch size of 32 and an optimal set of hyper-parameters determined by the authors of the datasets.To mitigate class imbalance, we use weighted cross-entropy loss.We provide finetuning details in Table 6 (Appendix A).

Table 1
reports LA classification results.Linear classifiers trained on the TDA features boost Transformer LMs performance; that trend is consistent across all models, with the MCC score gain of

Table 1 :
Acceptability classification results of monolingual LMs and linear classifiers trained on the sets of features by the benchmark.IDD=in domain development set.OODD=out of domain development set.TDA ext =TDA features+chordality and the matching number.The best score is in bold, and the second-best one is underlined.

Table 2 ,
Table 9, Appendix B.1). Hence LMs do not generalize well to unseen unacceptable phenomena and have little knowledge about the unseen linguistic properties.

Table 2 :
Overall per-category recall by the benchmark.