Acceptability Judgements via Examining the Topology of Attention Maps

The role of the attention mechanism in encoding linguistic knowledge has received special interest in NLP. However, the ability of the attention heads to judge the grammatical acceptability of a sentence has been underexplored. This paper approaches the paradigm of acceptability judgments with topological data analysis (TDA), showing that the geometric properties of the attention graph can be efficiently exploited for two standard practices in linguistics: binary judgments and linguistic minimal pairs. Topological features enhance the BERT-based acceptability classifier scores by $8$%-$24$% on CoLA in three languages (English, Italian, and Swedish). By revealing the topological discrepancy between attention maps of minimal pairs, we achieve the human-level performance on the BLiMP benchmark, outperforming nine statistical and Transformer LM baselines. At the same time, TDA provides the foundation for analyzing the linguistic functions of attention heads and interpreting the correspondence between the graph features and grammatical phenomena.


Introduction
Linguistic competence of neural language models (LMs) has emerged as one of the core sub-fields in NLP.The research paradigms explore whether Transformer LMs (Vaswani et al., 2017) induce linguistic generalizations from raw pre-training corpora (Warstadt et al., 2020b;Zhang et al., 2021), what properties are learned during task-specific fine-tuning (Miaschi et al., 2020;Merchant et al., 2020), and how the experimental results are connected to grammar and language acquisition theories (Pater, 2019;Manning et al., 2020).
One of these paradigms is centered around acceptability judgments, which have formed an empirical foundation in generative linguistics over the last six decades (Chomsky, 1965;Schütze, 1996;Scholz et al., 2021).Acceptability of linguistic stimuli is traditionally investigated in the form of a forced choice between binary categories or minimal pairs (Sprouse, 2018), which are widely adopted for acceptability classification (Linzen et al., 2016;Warstadt et al., 2019) and probabilistic LM scoring (Lau et al., 2017).
A scope of approaches has been proposed to interpret the roles of hundreds of attention heads in encoding linguistic properties (Htut et al., 2019;Wu et al., 2020) and identify how the most influential ones benefit the downstream performance (Voita et al., 2019;Jo and Myaeng, 2020).Prior work has demonstrated that heads induce grammar formalisms and structural knowledge (Zhou and Zhao, 2019;Lin et al., 2019;Luo, 2021), and linguistic features motivate attention patterns (Kovaleva et al., 2019;Clark et al., 2019).Recent studies also show that certain heads can have multiple functional roles (Pande et al., 2021) and even perform syntactic functions for typologically distant languages (Ravishankar et al., 2021).
Our paper presents one of the first attempts to analyze attention heads in the context of linguistic acceptability (LA) using topological data analysis (TDA2 ; Chazal and Michel, 2017).TDA allows for exploiting complex structures underlying textual data and investigating graph representations of Transformer's attention maps.We show that topological features are sensitive to well-established LA contrasts, and the grammatical phenomena can be encoded with the topological properties of the attention map.
The main contributions are the following: (i) We adapt TDA methods to two standard approaches to LA judgments: acceptability classification and scoring minimal pairs ( §3).(ii) We conduct acceptability classification experiments in three Indo-European languages (English, Italian, and Swedish) and outperform the established baselines ( §4).(iii) We introduce two scoring functions, which reach the human-level performance in discriminating between minimal pairs in English and surpass nine statistical and Transformer LM baselines ( §5).(iv) The linguistic analysis of the feature space proves that TDA can serve as a complementary approach to interpreting the attention mechanism and identifying heads with linguistic functions ( §4.3, §5.3, §6).
2 Related Work 2.1 Linguistic Acceptability Acceptability Classification.Early works approach acceptability classification with classic ML methods, hand-crafted feature templates, and probabilistic syntax parsers (Cherry and Quirk, 2008;Wagner et al., 2009;Post, 2011).Another line employs statistical LMs (Heilman et al., 2014), including threshold-based classification with LM scoring functions (Clark et al., 2013).The ability of RNN-based models (Elman, 1990;Hochreiter and Schmidhuber, 1997) to capture long-distance regularities has stimulated investigation of their grammatical sensitivity (Linzen et al., 2016).With the release of the Corpus of Linguistic Acceptability (COLA; Warstadt et al., 2019) and advances in language modeling, the focus has shifted towards Transformer LMs (Yin et al., 2020), establishing LA as a proxy for natural language understanding (NLU) abilities (Wang et al., 2018) and linguistic competence of LMs (Warstadt and Bowman, 2019).Linguistic Minimal Pairs.A forced choice between minimal pairs is a complementary approach to LA, which evaluates preferences between pairs of sentences that contrast an isolated grammatical phenomenon (Schütze, 1996).The idea of discriminating between minimal contrastive pairs has been widely applied to scoring generated hypotheses in downstream tasks (Pauls and Klein, 2012;Salazar et al., 2020), measuring social biases (Nangia et al., 2020), analyzing machine translation models (Burlot and Yvon, 2017;Sennrich, 2017), and linguistic profiling of LMs in multiple languages (Marvin and Linzen, 2018;Mueller et al., 2020).

Topological Data Analysis in NLP
TDA has found several applications in NLP.One of them is word sense induction by clustering word graphs and detecting their connected components.The graphs can be built from word dictionaries (Levary et al., 2012), association networks (Dubuisson et al., 2013), and word vector representations (Jakubowski et al., 2020).Another direction involves building classifiers upon geometric structural properties for movie genre detection (Doshi and Zadrozny, 2018), textual entailment (Savle et al., 2019), and document classification (Das et al., 2021;Werenski et al., 2022).Recent works have mainly focused on the topology of LMs' internal representations.Kushnareva et al. (2021) represent attention maps with TDA features to approach artificial text detection.Colombo et al. (2021) introduce BARYSCORE, an automatic evaluation metric for text generation that relies on Wasserstein distance and barycenters.To the best of our knowledge, TDA methods have not yet been applied to LA.

Attention Graph
We treat Transformer's attention matrix A attn as a weighted graph G, where the vertices represent tokens, and the edges connect pairs of tokens with mutual attention weights.This representation can be used to build a family of attention graphs called filtration, i.e., an ordered set of graphs G τ i filtered by increasing attention weight thresholds τ i .Filtering edges lower than the given threshold affects the graph structure and its core features, e.g., the number of edges, connected components, or cycles.TDA techniques allow tracking these changes, identifying the moments of when the features appear (i.e., their "birth") or disappear (i.e., their "death"), and associating a lifetime to them.The latter is encoded as a set of intervals called a "barcode", where each interval ("bar") lasts from the feature's "birth" to its "death".The barcode characterizes the persistent features of attention graphs and describes their stability.
Example.Let us illustrate the process of computing the attention graph filtration and barcodes given an Example (1).(1) There is snowing today.
First, we compute attention maps for each Transformer head as shown in Figure 1a-1b (left).These two heads follow different attention patterns (Clark et al., 2019): attention to the next token (Figure 1a) and to the [SEP] token (Figure 1b).Next, we represent the map as a weighted graph, and conduct the filtration procedure for a fixed set of attention weight thresholds.The edges lower than each given threshold are discarded, which results in a set of six attention graphs with their maximum spanning trees (MSTs) becoming a chain (Figure 1c; τ =0.1), and a star (Figure 1d; τ =0.5).The families of attention graphs are used to compute persistent features ( §3.2).  1c; τ =0.05), which is shown as a blue bar in Figure 1a.By contrast, there are no cycles in the second family (Figure 1d) and on the corresponding barcode.

Persistent Features of Attention Graphs
We follow Kushnareva et al. (2021) to design three groups of persistent features of the attention graph: (i) topological features, (ii) features derived from barcodes, and (iii) features based on distance to attention patterns.The features are computed on attention maps produced by a Transformer LM.
Topological Features.Topological features include the first two Betti numbers of the undirected graph β 0 and β 1 and standard properties of the directed graph, such as the number of strongly connected components, edges, and cycles.The features are calculated on pre-defined thresholds over undirected and directed attention graphs from each head separately and further concatenated.
Features Derived from Barcodes.Barcode is the representation of the graph's persistent homology (Barannikov, 2021).We use the Ripser++ toolkit (Zhang et al., 2020) to compute 0/1dimensional barcodes for A attn .Since Ripser++ leverages upon distance matrices, we transform A attn as A ′ = 1 − max (A attn , A attn T ).
Next, we compute descriptive characteristics of each barcode, such as the sum/average/variance of lengths of bars, the number of bars with the time of birth/death greater/lower than a threshold, and the Features Based on Distance to Patterns.The shape of attention graphs can be divided into several patterns: attention to the previous/current/next token, attention to the [SEP]/[CLS] token, and attention to punctuation marks (Clark et al., 2019).We formalize attention patterns by binary matrices and calculate distances to them as follows.We take the Frobenius norm of the difference between the matrices normalized by the sum of their norms.The distances to patterns are used as a feature vector.
Notations.We summarize the notations used throughout the paper:  Example.We can identify the "birth" of the RTD feature at α=0.47, when an edge appears in G α=0.47 b between the connected component "trained/murmured" and the connected component with four vertices, namely "[SEP]", "[CLS]", ".", and "Tara" (Figure 2; the appearing edge is colored in green).We observe its "death", when the edge becomes present in both attention graphs at α=0.65 (the corresponding edge changes its color to grey in the graph G α=0.65 a ∩ G α=0.65 b ).When comparing the graphs in this manner, we can associate a lifetime to the feature by computing the difference between the moments of its "death" (e.g., α j =0.65) and "birth" (e.g., α i =0.47).The lifetimes are illustrated as the orange bars [α i , α j ] in Figure 2. The resulting value of RTD(G a , G b ) is the sum of lifetimes α j −α i over all such features.A formal description of RTD is provided in Appendix A.

Data
We use three LA classification benchmarks in English (COLA; Warstadt et al., 2019), Italian (ITACOLA;Trotta et al., 2021) and Swedish (DALAJ; Volodina et al., 2021).COLA and ITA-COLA contain sentences from linguistic textbooks and cover morphological, syntactic, and semantic phenomena.The target labels are the original authors' acceptability judgments.DALAJ includes L2-written sentences with morphological violations or incorrect word choices.The benchmark statistics are described in Table 1 (see Appendix C).We provide examples of acceptable and unacceptable sentences in English (3), Italian (4), and Swedish (5) from the original papers."This woman have impressed me." (5) a. Jag kände mig jättekonstig."I felt very strange."b. *Alla blir busiga med sociala medier.
"Everyone is busy with social media."
Baselines.We use the fine-tuned LMs, and a linear layer trained over the pooler output from the frozen LMs as baselines.
Our Models.We train Logistic Regression classifiers over the persistent features computed with each model instance: (i) the average length of bars (H 0 M ); (ii) concatenation of all topological features referred to as TDA ( §3.2).Following Warstadt et al., we evaluate the performance with the accuracy score (Acc.) and Matthew's Correlation Coefficient (MCC; Matthews, 1975).The fine-tuning details are provided in Appendix B.

Results
Table 1 outlines the LA classification results.Our TDA classifiers generally outperform the baselines by up to 0.14 MCC for English, 0.24 MCC for Italian, and 0.08 MCC for Swedish.The H 0 M feature can solely enhance the performance for English and Italian up to 0.1, and concatenation of all features receives the best scores.Comparing the results under the frozen and fine-tuned settings, we draw the following conclusions.The TDA features significantly improve the frozen baseline performance but require the LM to be fine-tuned for maximizing the performance.However, the TDA/H 0 M classifiers perform on par with the fine-tuned baselines for Swedish.The results suggest that our features may fail to infer lexical items and word derivation violations peculiar to the DALAJ benchmark.
Effect of Freezing Layers.Another finding is that freezing the Transformer layers significantly affects acceptability classification.Most of the frozen baselines score less than 0.1 MCC across all languages.The results align with Lee et al. (2019), who discuss the performance degradation of BERT-based models depending upon the number of frozen layers.With all layers frozen, the model performance can fall to zero.Results by Linguistic Features.We run a diagnostic evaluation of the fine-tuned models using a grammatically annotated version of the COLA development set (Warstadt and Bowman, 2019).Figure 3 (En-BERT and XLM-R; Figure 1 in Appendix C.1) present the results of measuring the MCC of the sentences including the major features.
The overall pattern is that the TDA classifiers may outperform the fine-tuned baselines, while the H 0 M ones perform on par with the latter.The performance is high on sentences with default syntax (Simple) and marked argument structure, including prepositional phrase arguments (Arg.Type), and verb phrases with unusual structures (Arg.Altern).The TDA features capture surface properties, such as presence of auxiliary or modal verbs (Auxiliary), and structural ones, e.g., embedded complement clauses (Comp Clause) and infinitive constructions (to-VP).The models receive moderate MCC scores on sentences with question-like properties (Question), adjuncts performing semantic functions (Adjunct), negative polarity items, and comparative constructions (Determiner).
Analysis of the Feature Space.The LA classification experiments are conducted in the sparse feature space, where the TDA features can strongly correlate with one another, and their contribution is unclear.We run a complementary experiment to understand better how linguistic features are modeled with topology.We investigate the feature space with dimensionality reduction (principal component analysis, PCA; Pearson, 1901) by interpreting components' structure and identifying the feature importance to the classifier's predictions using Shapley values (Shapley, 1953), a game-theoretic approach to the attribution problem (Sundararajan and Najmi, 2020).Appendix C.2 describes the experiment on the fine-tuned En-BERT + TDA model using the grammatically annotated COLA development set.
The results show that (i) features of the higherlayer heads, such as the average vertex degree, the number of connected components, edges, and cycles, and attention to the current token, contribute most to the major linguistic features.(ii) Attention to the [CLS]/next token is important to the Determiner, Arg.Type, Comp Clause, and to-VP properties, while attention to the first token and punctuation marks has the least effect in general.
(iii) The number of nodes influences the classifier behavior, which is in line with Warstadt and Bowman, who discuss the effect of the sentence length on the performance.

Models
We conduct the experiments using two Transformer LMs for English: BERT-base and RoBERTabase (Liu et al., 2019).
Baselines.We compare our methods with the results on BLIMP for human annotators and nine LMs (Warstadt et al., 2020a;Salazar et al., 2020).The baselines range from statistical N-gram LMs to Transformer LMs.
Our Models.Given a minimal pair as in Example (2), we build attention graphs G a and G b from each attention head of a frozen Transformer LM.
We use the H 0 M feature ( §3.2) and RTD ( §3.3) as scoring functions to distinguish between the sentences S a and S b .The scoring is based on empirically defined decision rules modeled after the forced-choice task: ; otherwise S b is acceptable.We evaluate the scoring performance of each attention head, head ensembles, and all heads w.r.t. each and all linguistic phenomena in BLIMP.The following head configurations are used for each Transformer LM and scoring function: • Phenomenon Head and Top Head are the bestperforming attention heads for each and all phenomena, respectively.The heads undergo the selection with a brute force search and operate as independent scorers.• Head Ensemble is a group of the bestperforming attention heads selected with beam search.The size of the group is always odd.We collect majority vote scores from attention heads in the group.• All Heads involves majority vote scoring with all 144 heads.We use random guessing in case of equality of votes.This setup serves as a proxy for the efficiency of the head selection.
Notes on Head Selection.Recall that the head selection procedure3 imposes the following limitation.Auxiliary labeled minimal pairs are required to find the best-performing Phenomenon, Top Heads, and Head Ensembles.However, this procedure is more optimal and beneficial than All Heads since it maximizes the performance when utilizing only one or 9-to-59 heads.We also analyze the effect of the amount of auxiliary data used for the head selection on the scoring performance ( §5.3).Appendix D.1 presents a more detailed description of the head selection procedure.

Results
We provide the results of scoring BLIMP pairs in Table 2.The accuracy is the proportion of the minimal pairs in which the method prefers an acceptable sentence to an unacceptable one.We report the maximum accuracy scores for our methods across five experiment restarts.The general trends are that the best head configuration performs on par with the human baseline and achieves the highest overall performance (RoBERTa-base + RTD; Head Ensemble).RoBERTa predominantly surpasses BERT and other baselines, and topological scoring may improve on scores from both BERT and RoBERTa for particular phenomena.Head Ensemble Results.Table 3 describes the most optimal Head Ensembles by Transformer LM.Most heads selected under the H 0 M and RTD scoring functions are similar w.r.t.LMs.While the selected BERT heads are distributed across all layers, the RoBERTa ones tend to be localized at the middle-to-higher layers.Although RoBERTa utilizes smaller ensembles when delivering the best overall score, some heads contribute in both LMs, most notably at the higher layers.
Overall, the RoBERTa H 0 M /RTD ensembles get the best results on Filler gap, Quantifiers, Island effects, NPI, and S-V agr as shown in Table 2, matching the human level and surpassing four larger LMs on all phenomena by up to 7.4% (GPT2medium and GPT2/BERT/RoBERTa-large).
Effect of Auxiliary Data.Note that the head selection can be sensitive to the number of additional examples.The analysis of this effect is presented in  Appendix D.2.The results show that head ensembles, their size, and average performance tend to be more stable when using sufficient examples (the more, the better); however, using only one extra example can yield the performance above 80%.

Discussion
Topology and Acceptability.The topological properties of the attention graph represent interpretable and versatile features for judging sentence acceptability and identifying acceptability contrasts in minimal pairs.As one of such properties, the sum length of bars (H 0 S) -and its normalized version (H 0 M ) -have proved to be efficient for both LA approaches.This simple feature can serve as a profitable input for LA classifiers and a scoring function to discriminate between minimal pairs.Figure 4 shows an example of the H 0 S sensitivity to COLA's question-like properties, such as whmovement out of syntactic islands and matrix and embedded questions.We provide more examples in Appendix E, which demonstrate the distribution shifts between the acceptable and unacceptable sentences.
Acceptability Phenomena.The underlying structure of the attention graph encodes various well-established grammatical concepts.We observe that the persistent graph features capture surface properties, morphological agreement, structural relationships, and simple/complex syntactic phenomena well.However, with topology, lexical items, optional syntactic elements, and abstract semantic factors may be difficult to infer.Attention to the first token and punctuation marks contribute least to LA classification, while the other attention pattern features capture various phenomena.
Linguistic Roles of Heads.Topological tools help gain empirical evidence about the linguistic roles of heads from another perspective.Our findings on the heads' roles align with several related studies.The results on the COLA-style and BLIMP benchmarks indicate that (i) a single head can perform multiple linguistic functions (Pande et al., 2021), (ii) some linguistic phenomena, e.g., phrasal movement and island effects, are better captured by head ensembles rather than one head (Htut et al., 2019), and (iii) heads within the same or nearby layers extract similar grammatical phenomena (Bian et al., 2021).

Conclusion and Future Work
Our paper studies the ability of attention heads to judge grammatical acceptability, demonstrating the profitable application of TDA tools to two LA paradigms.Topological features can boost LA classification performance in three typologically close languages.The H 0 M /RTD scoring matches or outperforms larger Transformer LMs and reaches human-level performance on BLIMP, while utilizing 9-to-59 attention heads.We also interpret the correspondence between the persistent features of the attention graph and grammatical concepts, revealing that the former efficiently infer morphological, structural, and syntactic phenomena but may lack lexical and semantic information.
In our future work we hope to assess the linguistic competence of Transformer LMs on related resources for typologically diverse languages and analyze which language-specific phenomena are and are not captured by the topological features.We are also planning to examine novel features, e.g., the number of vertex covers, the graph clique-width, and the features of path homology (Grigor'yan et al., 2020).Another direction is to evaluate the benefits and limitations of the H 0 M /RTD features as scoring functions in downstream applications.
We also plan to introduce support for new deep learning frameworks such as MindSpore4 (Tong et al., 2021) to bring TDA-based experimentation to the wider industrial community.

Limitations
8.1 Computational complexity Acceptability classification.Calculation of any topological feature relies on the Transformer's attention matrices.Hence, the computational complexity of our features is not lower than producing an attention matrix with one head, which is asymptotically O(n 2 d + nd 2 ) given that n is the maximum number of tokens, and d is the token embedding dimension (Vaswani et al., 2017).
The calculation complexity of the pattern-based and threshold-based features is done in linear time O(e + n), where e is the number of edges in the attention graph.In turn, the number of edges is not higher than n(n−1) 2 ∼ n 2 .The computation of the 0 th Betti number β 0 takes linear time O(e+n), as β 0 is equal to the number of the connected components in an undirected graph.The computation of the 1 st Betti number β 1 takes constant time, since β 1 = e − n + β 0 .The computational complexity of the number of simple cycles and the 1-dimensional barcode features is exponential in the worst case.To reduce the computational burden, we stop searching for simple cycles after a pre-defined amount of them is found.
Note that the computational costs could be reduced, e.g., by identifying the most contributing features or the best-performing heads.Consider an example in Figure 5, which illustrates how the COLA performance gets changed depending on the number of En-BERT heads.Here, the head selection is based on a simple procedure.First, we score the attention heads by calculating the maximum correlation between each head's features and the vector of the target classes on the train set.Second, we train a linear classifier over the TDA features produced by N attention heads ranked by the correlation values, as specified in §4.2.Satisfactory MCC scores can be achieved when utilizing less than 40 heads with a significant speed up at the inference stage.Linguistic Minimal Pairs.Computation of the H 0 M and RTD features is run via the Ripser++ GPU library.Under this library, the minimum spanning tree is found according to Kruskal's algorithm giving the computational complexity of H 0 M as O(n 2 log n).The complexity can be reduced using other algorithms, e.g., Prim's algorithm, which takes O(n 2 ).The RTD's computational complexity is more difficult to estimate.RTD is computed via persistence barcodes of dimension 1 for a specific graph with 2n vertices.Many optimization techniques and heuristics are implemented in the Ripser++ library that significantly reduce the RTD's complexity.

Empirical estimate.
Computing the H 0 M /RTD features with 144 BERT heads in the worst case of a 512-token text takes 2.41 and 94.5 sec (NVIDIA Tesla K80 12GB RAM).However, the actual computation time on the considered tasks is empirically more optimal.We provide empirical estimates on the entire BLiMP and LA datasets: 2.4/15.7 hours on BLiMP (H 0 M /RTD) and up to 2 hours on COLA/ITACOLA/DALAJ (estimates by the feature groups: topological features=24%; features derived from barcodes=70%; and features based on distance to patterns=6% of the total time).

Application Limitations
We also outline several application limitations of our approach.(i) The LA classifiers require preliminary fine-tuning of Transformer LMs to extract more representative attention graph features and, therefore, achieve better performance.(ii) RTD operates upon a one-to-one vertex correspondence, which may be hindered by tokens segmented into an unequal amount of sub-tokens.As a result, identifying the topological discrepancy between pairs of attention graphs can be restricted in practice, where the graphs are of an arbitrary number of nodes.Regardless of the potential information loss due to sentence truncation in such cases, the RTD heads still receive the best overall score on BLIMP.(iii) The head selection procedure relies on auxiliary data to identify the best-performing head configurations.Annotating the auxiliary data may require additional resources and expertise for practical purposes.However, the procedure maximizes the performance and reduces the computational costs by utilizing less attention heads.

Linguistic Acceptability
Acceptability judgments have been broadly used to investigate whether LMs learn grammatical concepts central to human linguistic competence.However, this approach has several methodological limitations.(i) The judgments may display low reproducibility in multiple languages (Linzen and Oseki, 2018), and (ii) be influenced by an individual's exposure to ungrammatical language use (D ąbrowska, 2010).(iii) Distribution shifts between LMs' pretraining corpora and LA datasets may introduce bias in the evaluation since LMs tend to assign higher probabilities to frequent patterns and treat them as acceptable in contrast to rare ones (Marvin and Linzen, 2018;Linzen and Baroni, 2021).

Ethical Statement
Advancing acceptability evaluation methods can improve the quality of natural language generation (Batra et al., 2021).We recognize that this, in turn, can increase the misuse potential of such models, e.g., generating fake product reviews, social media posts, and other targeted manipulation (Jawahar et al., 2020;Weidinger et al., 2021).However, the acceptability classifiers and scoring functions laid out in this paper are developed for research purposes only.Recall that the topological tools can be employed to develop adversarial defense and artificial text detection models for mitigating the risks (Kushnareva et al., 2021).

A Representation Topology Divergence
Suppose we have two weighted full graphs G a , G b with one-to-one vertex correspondence.Define their vertices as {a 1 , a 2 , . . ., a n } and {b 1 , b 2 , . . ., b n } respectively so that a i corresponds to b i for each i.RTD(G a , G b ) is calculated as follows: 1. Build a full weighted graph G ab with the vertices set V = {v 1 , v 2 , . . .v n , u 1 , u 2 , . . ., u n } and the edge weights computed as where w a and w b are the edge weights in the corresponding graphs.2. Compute the barcode (Barannikov, 2021) of the H 1 homology group of the graph G ab flag complex.It should be emphasized that the H 0 homology group barcode for this graph is empty since the minimum spanning tree of G ab has the total weight of 0. Instead of H 1 , the higher-order homology groups (e.g., H 2 , H 3 ) can be considered.However, the preliminary experiments have shown that they are less helpful for LA tasks.3. RTD(G a , G b ) is calculated as the sum of bar lengths in the barcode from the previous step.
It should be noted that this procedure is asymmetric on G a and G b , and for non-equal graphs holds RTD(G a , G b ) ̸ = RTD(G b , G a ).To compute barcodes, we use the Ripser++ toolkit, which cannot work with asymmetric graphs.Hence, we represent the asymmetric attention maps as the distance matrices to obtain the symmetric graphs G a and G b as described in §3.2.We consider only the forward-looking part of attention, i.e., how each token affects the rest of the sentence.
The majority of the BLIMP minimal pairs are of equal length in the BERT/RoBERTa tokens.Otherwise, we truncate the longest sentence to achieve an equal length since the one-to-one correspondence between tokens is crucial for RTD.We assume that the truncation may remove tokens that help to discriminate between the acceptable and unacceptable sentences.We leave improvement of the pre-processing stage for future work.

B Fine-tuning Details
Fine-tuning and evaluation of the BERTbased/XLM-R acceptability classifiers follow the standard procedure under the HuggingFace library (Wolf et al., 2020).Each model is fine-tuned for 4 epochs with the learning rate of 1e −2 /1e −3 , batch size of 32, and the other default hyperparameters.

C.2 Analysis of the Feature Space
We analyze the contribution of the topological features to acceptability classification in the context of linguistic phenomena.We interpret the principal components computed on the fine-tuned En-BERT + TDA features and identify their importance w.r.We also explore masking principal components, i.e., training the classifier using only the most important components while zeroing the weights of the others.
Table 2 shows results for the full pipeline (En-BERT + TDA + PCA) and masked pipelines (En-BERT + TDA + PC 1 /PC 2 ).Since the performance is comparable with the En-BERT + TDA classifier in §4, we rely on the PCA decomposition for the feature analysis and interpretation.Results.The following six principal components (PCs) contribute most to acceptability classification according to the mean absolute Shapley values ϕ (see Figure 2).Figure 3 shows the Shapley values for these PCs by the major linguistic feature.PC1 (ϕ=3.179) has the most impact on the classifiers' performance.PC1 primarily contains simple topological features (the average vertex degree, the number of edges, and the number of connected components) from the heads at the last layer, which is affected most by the fine-tuning.PC7 (ϕ=0.601)includes same heads as PC1, but its features utilize the number of cycles in the attention graph.PC9 (ϕ=0.442)groups all attention patterns except for attention to commas for heads at the lower and middle layers.The component attributes to all phenomena.The following four PCs are less important for acceptability classification (ϕ j < 0.25) in general but may contribute to some linguistic phenomena.PC16 (ϕ=0.243)comprises topological and distance to pattern features of different heads at the middle layers.The PC contributes to negative polarity and free choice items, non-finite complementizer phrases, and comparative constructions.PC20 (ϕ=0.216)reflects attention-to-comma for various heads at the lower layers.However, this feature helps to classify sentences that fall under the S-Syntax and Question categories.PC15 (ϕ=0.216)includes the attention to the first token pattern for the middle-to-higher layer heads (generally 4-to-10).It works for Passive and By-Phrases.PC2 (ϕ=0.203)reflects the number of graph edges for heads at the first layer, which captures strong pair-wise information about tokens.This head is important for sentences with default syntax (Simple).
PC3 and PC6 (ϕ < 0.05) represent attention to the dot pattern.The PCs are not important for any of the linguistic phenomena and have large eigenvalues.

D Attention Head Selection D.1 Head Selection Procedure
We use publicly available scripts 5 to generate up to 100 minimal pairs per each of 67 types, ensuring no overlap with the BLIMP pairs.We select the best-performing individual heads and head ensembles by estimating their scoring performance on the generated data and further evaluate them on BLIMP.Algorithms 1-2 describe the Top Head and Phenomenon Head selection procedures using a brute force search, while Algorithm 3 presents the process of selecting the Head Ensembles via beam search.

D.2 Effect of Auxiliary Data
We analyze the effect of the amount of auxiliary generated data on the RTD scoring performance.We explore N ∈ [1, 5, 10, ..., 100] sentence pairs 2:
E The H 0 S Feature Distributions

F Toy Examples of Calculating Features
Let us demonstrate a calculation of essential barcode features of a toy graph G toy ( Figure 7a).First, we calculate the H 0 -barcode of this graph.To do it, we build the graph G ′ toy by replacing each edge weight w with 1 − w, as in Figure 7b.Next we calculate the minimum spanning tree of this new graph (Figure 7c).We end up with the H 0 -barcode with the lengths of bars, equal to the weights of the minimum spanning tree (Figure 7d).We can derive H 0 S(G toy ) = 0.3 + 0.4 + 0.5 = 1.2 and H 0 M (G toy ) = H 0 S(G toy )/3 = 0.4 from this barcode diagram.
Note that the directions of the bars (Figure 1) were reversed comparing to the actual ripser++ output (Figure 7d).The reversed representation is more intuitive: edges with lower weights are filtered out earlier than edges with the higher weights.
Next we compute the Betti numbers for the same graph G toy given three thresholds: τ 1 = 0, τ 2 = 0.4 and τ 3 = 1.At τ = 0 we do not drop any edges.We have the full graph with one connected component (Figure 7a).β 0 is defined as the number of connected components.Hence β 0 equals to 1. Next we calculate β 1 , using the shortcut formula for graphs: In our case, |E| = 6 is the number of edges, |C| = 1 is the number of connected components and |V | = 4 is the number of vertices.Finally, we get β 1 = 3.Note that β 1 corresponds to three simple undirected loops in the graph.There is also an alternative method to represent the graph and to calculate the first Betti number.This method does not account for "trivial" loops, which are defined by the triangle borders.It was used in the example above, see Figure 1.
At τ = 0.4, we drop all edges with weights lower than 0.4.We get the same structure as the minimum spanning tree of the graph G ′ toy (Figure 7c), but without weights inversion.For this graph, β 0 = 1: there is a single connected component, β 1 = 3 + 1 − 4 = 0.It corresponds to the number of simple loops, which equals to 0.
At τ = 1, we drop all edges as all edges have weights below than 1.The resulting graph consists only of vertices without edges.For this case, we have four connected components, so β 0 = 4, and β 1 = 0 + 4 − 4 = 0.

Figure 1a -
Figure 1a-1b (right) depict barcodes for each family of graphs.The bars are sorted by length.The number of bars equals |T | − 1, where |T | is the number of tokens in the input sentence.The bars in yellow correspond to the 0-dimensional features acquired from the edges of the MST.The bars in blue refer to 1-dimensional features, which stand for non-trivial simple cycles.Such cycle appears in the first family (Figure1c; τ =0.05), which is shown as a blue bar in Figure1a.By contrast, there are no cycles in the second family (Figure1d) and on the corresponding barcode.

Figure 2 :
Figure 2: A graphical representation of RTD-barcodes.In the top row given A ′ matrices derived from attention maps for acceptable and unacceptable sentences.Edges present in both graphs G αi a and G αi b at a given threshold α i are colored in grey.Edges present only in graph G αi b are colored in green.
had trained Tara.b. *Cheryl had murmured Tara.First, we compute attention maps for the input sentences S a and S b with a Transformer LM, and represent them as the weighted graphs G a and G b .Next, we establish a one-to-one match between the vertices, and sort the filtrations G α i a and G α i b with α=1 − τ in the ascending order.We then track the hierarchical formation of connected components in the graph G α i a ∩ G α i b while increasing α i .The RTD(G a , G b ) feature appears at threshold α i if an edge with the weight α i in the graph G b joins two different connected components of the graph G α i a ∩ G α i b .This feature disappears at the threshold α j if the two G α i a ∩ G α i b connected components become joined in the graph G α j a .

Figure 3 :
Figure 3: Performance (MCC) of the fine-tuned En-BERT and XLM-R by major linguistic feature.Average MCC scores are represented with dashed lines.The number of sentences including the feature is placed in square brackets.

5. 1
DataBLIMP (Benchmark of Linguistic Minimal Pairs;Warstadt et al., 2020a) evaluates the sensitivity of LMs to acceptability contrasts in terms of a forced choice between minimal pairs, as in Example (6).The benchmark consists of 67 pair types, each including 1k pairs covering 12 language phenomena in morphology, syntax, and semantics.(6) a. Whose hat should Tonya wear? b. *Whose should Tonya wear hat?

Figure 4 :
Figure 4: The distribution shift of the H 0 S feature between the acceptable and unacceptable sentences (Question); [L: 10; H: 0].

Figure 5 :
Figure 5: Performance on the COLA development set depending on the number of heads for En-BERT + TDA.

Figure 1 :
Figure 1: Performance (MCC) of the fine-tuned XLM-R by major linguistic feature.Average MCC scores are represented with dashed lines.The number of sentences including the feature is placed in square brackets.

Method.
The pipeline assembles the feature standardization, PCA and training a logistic regression classifier.We conduct a grid search over two pipeline's parameters (i) the number of components N comp ∈ [10, 20, . . ., 100] (the found optimum: N comp = 100) and (ii) regularization parameter of logistic regression L 1 ∈ [0.01, 0.02, . . ., 0.1] (the found optimum: L 1 = 0.1).The parameter search is run across 3 stratified folds, where the COLA train set is randomly split into train/development sets.The classifier performance is evaluated on the grammatically annotated COLA development set.

Figure 2 :
Figure 2: Importance of the PCs for judging sentence acceptability.Shapley values ϕ reflect the PCs' impact on the classifier output.

Figure 3 :
Figure 3: Concatenated mean absolute Shapley values for the important PCs by major linguistic feature.
Head Ensemble SelectionInput: Set Q1: contains all possible pairs (h, r), where h -attention head and r ∈ {1, 2} -scoring rule Require: acc(.): accuracy evaluation function of the ensembles with selected scoring rules using majority voting Output: Ensemble B: set of pairs (H, R)procedure SELECTING HEAD ENSEMBLE(Q1)

Figure 4 :
Figure 4: The effect of a given amount of examples on the BLIMP performance of selected Head Ensembles by major category.Method=RTD scoring.N=number of extra examples per phenomenon.

Figure 5 andFigure 5 :
Figure 5 and Figure 6 illustrate examples of the H 0 S feature distribution shifts between the acceptable and unacceptable sentences from the entire COLA development set.

Figure 6 :
Figure 6: The distribution shift of the H 0 S feature between the acceptable and unacceptable sentences (Binding); [L: 11; H: 4].

Figure 7 :
Figure 7: An example of a weighted graph and corresponding H 0 -barcode, calculated with the Ripser++ library.
Dev OODD / Test IDD / Dev OODD / Test Acc.MCC Acc.MCC Acc.MCC Acc.MCC Table 1: Acceptability classification results by benchmark.IDD="in domain dev" set (COLA).OODD="out of domain dev" set (COLA).Dev; Test=dev and test sets in ITACOLA and DALAJ.The best score is put in bold, the second best score is underlined.

Table 2 :
ModelOv era ll An a. ag r Ar g. str Bi nd ing Ct rl./ra is.D-N ag r El lip sis Fil ler Irr eg .Isl an d NP I Qu an t.S-V ag r Percentage accuracy of the baseline models, human baseline, and our methods on BLIMP.Overall is the average across all phenomena.The best score is put in bold, the second best score is underlined.BERT-baseRoBERTa-base

Table 3 :
Results of selecting the best-performing HeadEnsembles with H 0 M /RTD-based scoring.H 0 M heads are colored in green; RTD heads are colored in yellow.

Table 1 :
Statistics of acceptability classification benchmarks.Type=Type of data source.%=Percentage of acceptable sentences.Morph=Morphology.
Set Q1: contains all possible pairs (h, r), where h -attention head and r ∈ {1, 2} -scoring rule Require: acc(.):accuracy evaluation function of the head with the selected rule on pairs for all phenomenas Output: Pair (H B , R B )procedure SELECTING TOP HEAD(Q1) 5 github/alexwarstadt/data_generationAlgorithm 1 Top Head SelectionInput: Set Q1: contains all possible pairs (h, r), where h -attention head and r ∈ {1, 2} -scoring rule Require: C : linguistic category Require: acc C (.) : accuracy evaluation function of the head with the selected scoring rule on the C pairs Output: Pair (H C , R C ) procedure SELECTING PHENOMENON HEAD(Q1) Input: