Evaluating Transformer’s Ability to Learn Mildly Context-Sensitive Languages

Despite the fact that Transformers perform well in NLP tasks, recent studies suggest that self-attention is theoretically limited in learning even some regular and context-free languages. These findings motivated us to think about their implications in modeling natural language, which is hypothesized to be mildly context-sensitive. We test the Transformer’s ability to learn mildly context-sensitive languages of varying complexities, and find that they generalize well to unseen in-distribution data, but their ability to extrapolate to longer strings is worse than that of LSTMs. Our analyses show that the learned self-attention patterns and representations modeled dependency relations and demonstrated counting behavior, which may have helped the models solve the languages.


Introduction
Transformers (Vaswani et al., 2017) have demonstrated well-versed language processing capabilities and enabled a wide range of exciting NLP applications ever since its inception.However, Hahn (2020) shows that hard self-attention Transformers, as well as soft attention under some assumptions, fail at modeling regular languages with periodicity as well as hieararchical context-free languages eventually when presented with long enough sequences.
These theoretical limitations have since sparked the interest of the formal language community.A variety of formal languages, as well as formal models of computation such as circuits, counter automata, and predicate logic, have been studied to characterize the expressiveness of the architecture.
When it comes to probing an architecture's linguistic adequacy, a particular class of formal languages and formalisms naturally comes into sight: the mildly context-sensitive class (Joshi, 1985; Figure 1: How certain MCSGs fit on the Chomsky hierarchy of languages in terms of their weak generative capacities (Stabler, 2011): CFL ⊂ L(TAG) = L(CCG) ⊂ L(MG) = L(LCFRS) = L(MCFG) ⊂ CSL.No grammar generates the largest set of MCSLs.Kallmeyer, 2010), the formal complexity class hypothesized to have the necessary expressive power for natural language.
This motivates us to study Transformer's ability to learn a variety of linguistically significant, mildly context-sensitive string languages of varying degrees of complexities.Specifically, we ask two research questions: 1. How well do Transformers learn MCSLs from finite examples, both in terms of generalizing to in-distribution data, as well as extrapolating to strings longer than the ones seen during training?
2. What kind of meaningful representations or patterns do the models learn?
2 Mildly Context-Sensitive Hypothesis In search of computational models that have adequate power to generate natural language sentences while also assigning meaningful structural descriptions like trees to them, Chomsky (1956Chomsky ( , 1959) ) defined context-sensitive grammar (CSG) and context-free grammar (CFG) as intermediate systems that lie between two extremities: the Turing machine which overgenerates and the finitestate automaton which undergenerates.A question that immediately follows the definitions is whether the CFG could serve as a computational model for natural language, which had been an open question for a few decades until it was settled by evidence such as Swiss German cross-serial dependency (Figure 2; Shieber, 1985) and Bambara vocabulary (Culy, 1985), which demonstrated the existence of natural languages that are supra-context-free.
However, although more restricted than the Turing machine, the CSG is also undesired as it still has much more generative capacity than natural languages should ever need, and as a result of that, is hard to parse efficiently.This motivated Tree-Adjoing Grammar (TAG;Joshi, 1985) and Combinatory Categorial Grammar (CCG; Steedman, 1987) among a few other weakly equivalent formalisms such that extend CFG with just enough additional descriptive power so that phenomena like Swiss German cross-serial dependency can be treated, while not using the full CSG thus parsing can still be efficient.The properties of these formalisms with such additional power roughly characterize a class of languages and grammars that Joshi (1985) calls mildly context-sensitive languages/grammars (MCSL/MCSG).
Another related line of weakly equivalent formalisms, such as Linear Context-Free Rewriting Systems (LCFRS; Vijay-Shanker et al., 1987), Multiple Context-Free Grammar (MCFG;Seki et al., 1991), and Minimalist Grammar (MG;Stabler, 1997), further extend their expressive power beyond that of TAG, as motivated by more complex phenomena like German Scrambling (Becker et al., 1991).While no single grammar generates the largest possible set of MCSLs that satisfy the formal characterization in Kallmeyer (2010), these formalisms are the closest approximations we have.Such differences in expressiveness formed a subhierarchy within this class (Figure 1), and the languages recognizable by TAG and MG and their respective equivalents, denoted as L(TAG) and L(MG), are therefore the language complexity classes that we examine in this work.
Therefore, one hypothesis for the complexity of natural language is that it is mildly contextsensitive (henthforth MCS).There have been some challenges citing linguistic data requiring more power beyond MCS (Radzinski, 1991;Michaelis and Kracht, 1997;Kobele, 2006), but the validity of these claims remains controversial (Bhatt and Joshi, 2004;Clark and Yoshinaka, 2012), or no consensus has been reached on the need for more power (Graf, 2021).Thus, while acknowledging that whether the MCS hypothesis is true remains an open question, it is a reasonably good hypothesis that allows us to analyze natural languages meaningfully.

Related Work
Regarding the expressiveness of the Transformer, Pérez et al. (2019Pérez et al. ( , 2021) ) established the Turingcompleteness of the hard-attention Transformer.Bhattamishra et al. (2020b) proves the Turingcompleteness of soft attention by showing that they can simulate RNNs.
However, these results assumed arbitrary precision for weights and activations, had certain departures from the original architecture, and made the proofs through their unique task definitions.In a practical use case of language learning from finite examples, the Transformer's ability is known to be limited.Merrill (2019) showed that the class of regular languages is not a subset of Transformerrecognizable languages.Specifically, Transformers of different self-attention variants have limited abilities to learn certain star-free languages (Bhattamishra et al., 2020a), as well as non-star-free, periodic regular languages (Hahn, 2020;Bhattamishra et al., 2020a).On the other hand, Chiang and Cholak (2022) discussed several constructions to overcome the limitations in learning periodic regular languages.
As for languages with hieararchical structural analysis, Ebrahimi et al. (2020) empirically demonstrated one setup in which such languages can be learned and observed stack-like behavior in selfattention patterns.Yao et al. (2021) showed by construction that self-attention can learn hieararchical languages with a bounded depth, although the boundedness reduces the CFL to regular.Additional work include Bernardy et al. (2021); Wen et al. (2023), among others.
Besides language recognition guided by Chomsky hierarchy, another line of research investigates other alternative formal languages, such as counterrecognizable languages (Bhattamishra et al., 2020a) and first-order logic (Merrill and Sabharwal, 2022;Chiang et al., 2023), to characterize the expressiveness of Transformers.
This work introduces MCSGs into the discussion through assessing Transformers on a variety of string languages recognizable by MCSGs with varying weak generative capacities, which has not yet been studied as a whole like some other complexity classes.Occasionally, studies on the Transformer's ability worked with data that conveniently fall into this class, including a few counter languages that are also TAG-recognizable (Bhattamishra et al., 2020a), discontinuities in Dutch (Kogkalidis and Wijnholds, 2022), reduplication (Deletang et al., 2023), as well as a crossed parentheses language inspired by crossing dependency1 (Papadimitriou and Jurafsky, 2023).This work complements these and other aforementioned related work by presenting a systematic evaluation guided by basic MCSL constructions and the subhierarchy within the class, as well as comparing each of the basic constructions against a less and a more complex counterparts.

Task Setups
Our experiments use the original soft-attention Transformer with sinusoidal positional encoding (henthforth PE) as defined in Vaswani et al. (2017), and we focus on an encoder-only model, similar to how the architecture is used in BERT (Devlin et al., 2019).We use no dropout to prevent harming performance, as we are already working with simple abstract formal languages.For each experiment, we also train an LSTM (Hochreiter and Schmidhuber, 1997) baseline for comparison.The implementations2 for both the Transformer and the LSTM are from PyTorch (Paszke et al., 2019).
We use one of the following two established tasks for each of the languages depending on which better enables learning for the data.We further elaborate on the reasoning for the choice of task in the Appendix A.
Binary Classification Task Following Weiss et al. (2018), a model g is said to recognize a formal language L ∈ Σ * if f (g(w)) = 1 for all and only strings w ∈ L. In our case, g is the Transformer encoder, and g(w) is the representation of a positive or negative example w, which is averaged from all symbol embeddings for symbols in w. f is a fully connected linear layer that maps the pooled embeddings representation to a real number, which is then passed through the sigmoid function to obtain the class label 0 or 1 using a threshold of 0.5.The loss is the BCE loss between the prediction and the target label.
Next Character Prediction Task For languages in which training with positive and negative examples is ineffective because too few examples are available or the set of possible negative examples is too large, we use this task, which only requires positive examples.Given a valid prefix of a string in the language at each timestep, the model is tasked to predict the next set of acceptable symbols, or predicts [EOS] if the prefix is already a string in the language.This is a multi-label classification task where the output is a k-hot vector of dimension |Σ∪{[EOS]}|.The symbol embeddings are passed through a linear layer and the sigmoid function in parallel to obtain k-hot vectors at each timestep.The loss is the individual symbol's BCE loss between the predicted and the target k-hot vectors summed and then averaged.A look-ahead mask is applied to prevent self-attention from attending to later positions, which indirectly offers positional information (Irie et al., 2019;Bhattamishra et al., 2020a;Haviv et al., 2022).

Data
Following the categorization in Ilie (1997), we are interested in three basic constructions that should be contained in MCSLs: (1) copying: ww, (2) crossing dependency: a n b m c n d m , (3) multiple agreements: a n b n c n .All these languages are TAG-recognizable.We try to compare each of the three languages with a similar but less complex context-free language, as well as a similar but more complex MG-recognizable language.We also investigate two relevant scramble languages that are also felicitously MCFG-recognizable.

Copying
Copy language {ww | w ∈ {a, b} * } is in L(TAG) (Joshi, 1985).Its context-free counterpart is the palindrome {ww R | w ∈ {a, b} * }, where w R is the reverse of the string w.Joshi (1985) indicates double copying language www is not in L(TAG).However, any multiple copying w k is in L(MG) (Jäger and Rogers, 2012).Thus, we study www as the simplest strictly MG-recognizable language for copying.
We use the binary classification setup for this family of languages.To generate the strings, we enumerate each possible w in our chosen |w| range and then duplicate w to produce ww R , ww, and www.Negative examples are random strings sampled from {a, b} * in the same length range as the positive examples, but are not in the set of positive examples.

Crossing Dependency
Cross-serial dependency language a n b m c n d m is in L(TAG) (Joshi, 1985).Its context-free counterpart is the nesting dependency language a n b m c m d n .We use the next character prediction task for these two languages because the potential set of negative examples is too large.Following Gers and Schmidhuber (2001), to recognize nested a n b m c m d n , when the input is a, the next valid character is a or b.As soon as the input becomes b, the value of n is determined, then the next valid symbol is now b or c.Once the input becomes c, the value of m is also determined, and the next characters are deterministic from this point on.Lastly, we output [EOS] as soon as the final symbol in the input is consumed.Following the notation in Suzgun et al. (2019), we denote the above described input-target scheme as the following, where ⊣ denotes [EOS]: Trivially, this scheme can be generalized to the crossing dependency language:

Multiple Agreements
Being simple counter languages, the multiple agreements family had previously been extensively studied.We complement the related work by adding an MG-recognizable language, as well as giving additional analyses on the learned pattern.
Both Stabler (1997) indicates that it is in L(MG).Moreover, σ n 1 , ..., σ n k for any arbitrary k is MGrecognizable (Jäger and Rogers, 2012).Thus, we study a n b n c n d n e n as the simplest strictly MGrecognizable language in this family.
Since very few examples are available, we use the next character prediction task, which is also an established setup for these languages.Gers and Schmidhuber (2001); Suzgun et al. (2019) have proposed the input-target schemes for a n b n , a n b n c n , and a n b n c n d n : Trivially, this scheme can be generalized to the MG-recognizable a n b n c n d n e n : That is, the next valid character is a or b as long as the input is a, but once b occurs in the input, n will be determined and the next characters will be deterministic from this point on.(1985) argued MCSG should only handle limited cross-serial dependency like the type in Dutch (Bresnan et al., 1982)  MIX resembles an extreme case of free word order and is not recognizable by TAG (Kanazawa and Salvati, 2012).However, it turned out that the language is in L(MCFG) (Salvati, 2015).work, and we also include detailed dataset statistics in Appendix B.

Experiments
Each model is evaluated on three sets: an indistribution held-out test set, an out-of-distribution (henthforth OOD) set with strings longer than the ones seen during training, and a second OOD set with even longer strings.We report the mean and standard deviation of test accuracies in three runs with different random seeds.
We then visualize and analyze the clearest head in our best-performing runs, but we note that all visualized patterns do recur across different configurations.

Copy Languages
Our Transformer models learned ww and the related ww R , www with high accuracy and outperformed LSTMs in in-distribution tests.However, in the two OOD tests, only LSTMs were able to extrapolate, while the accuracies of the Transformers are close to random guesses (Table 2).
We identified certain heads that align the substrings in different diagonalities to measure the similarity of a string to itself (Figure 3).For ww, the gold alignment is to align the first w against the second and vice versa.In the visualized run, we find that among all positive examples in the test set, 93.4% of the time the highest query-key attention is on the gold alignment.For ww R , the gold alignment expects the head of a substring to attend to the tail of the other substring, thus resulting in an anti-diagonal pattern, and 94.8% of the time the highest attention is on the gold alignment diagonals among all positive examples in the test set.
For www, a gold alignment requires each substring to attend to the other two, resulting in a total of six alignments, which we did get during training as shown in Figure 4.The six alignments are distributed across heads in a multi-head model, where only three of the six are better aligned, while the other partial alignments appear to be auxiliary if were at all useful for inference.In the visualized model, the three clearer alignments in the first head on average match the gold alignment 86.1% of the time over positive examples in the test set, while the accuracy is 39.0% for the other three partial alignments.

Crossing
The in-distribution tests were solved almost perfectly by all three model setups, including a setup where we remove sinusoidal PE and rely only on indirect positional information from the look-ahead mask.For OOD tests, we evaluate models on strings where both n, m are OOD, as well as strings in which only one of n, m is OOD.We find that removing PE helped with the Transformer's extrapolation, which is consistent with the finding in Bhattamishra et al. (2020a) on other languages.However, such extrapolation ability is still not as good as that of LSTMs (Table 3).
Figure 5 shows that the Transformer's attention formed a checkerboard pattern for recognizing Table 3: Nesting/Crossing: Not using sinusoidal PE helped with extrapolation.Note that the three tests do not share datapoints despite the seemingly partially overlapping length ranges.
Figure 5: Crossing shows a checkerboard pattern as the result of correctly identifying pairwise dependents, while nesting has a similar pattern except for different symbol dependencies.
crossing, as the result of each symbol in the query attends to every occurrence of itself and its dependent in the key but not to the non-dependents.As for nesting, the pattern is very similar except that the dependents are different.Models trained with or without sinusoidal PE end up learning very similar patterns, except that without PE, the attention from one query symbol to every occurrence of a key symbol is uniformly distributed, resulting in a stack of color bands on the attention map of the visualized head.
The attention maps suggest that in the optimal case, each symbol in query identifies to which other symbol in key it is pairwise dependent, and then in the visible portion without look-ahead mask, distributes its attention to every occurrence of itself and the dependent, and gives zero attention to the other pair.We measure as an example how accurately the visualized head in the crossing model without PE has implemented this optimum, and we find that across all in-distribution test set datapoints, keys that expect 100% of the attention weights from each symbol in query have received on average 93.0% of the attention weights.

Multiple Agreements
We follow the established finding in Bhattamishra et al. (2020a) and only consider the Transformer without sinusoidal PE for these languages, as training with PE was ineffective in pilot experiments.The Transformers without PE demonstrated the ability to extrapolate, although on average they are still not as good or consistent as LSTMs (Table 4).We annotate the mean and standard deviation in percentages among all attention values from one query symbol to one key symbol in Figure 6.It can be observed that every input alphabet symbol attends to different symbols in the output alphabet using a different weight, but the attention values to each occurrence of the same symbol are similar, and thus have low variance, culminating in the grid pattern we see on the maps.The differences in weights from a query symbol suggested a particular dependency analysis learned by that run.
Unlike what we have discussed for copying and crossing, the multiple agreements strings do not have a definite dependency relation to learn, and many possible analyses exist for the same string, e.g., any two symbols in the alphabet could be pairwise dependent, or all symbols in the alpha- bet could be dependent on each other (cf. Joshi (1985)).Thus, from run to run, to which key symbol the query symbol gives the most attention, and how much attention is given to each symbol, could indeed vary.Also, the lack of a gold analysis has determined that the model cannot simply focus on some of the symbols, and it is crucial for every symbol to attend to every symbol as we see here.

Scramble Languages
We use the macro F-1 score as the metric for this set of languages, since our data generation is skewed towards negative examples.Despite the seeming complexity of the data, Transformers are able to solve MIX and O 2 in-distribution test sets perfectly, while LSTMs also have very high scores.However, the MIX OOD sets are challenging for both models, while LSTMs outperformed Transformers in solving the OOD sets for O 2 (Table 5).
Since a n b n c n ⊂ MIX and a n b m c n d m ⊂ O 2 , we use unscrambled strings in the visualizations for readability in Figure 7. MIX has a pattern that resembles the one in multiple agreements in that the amount of attention from one symbol in query to one symbol in key among all occurrences is similar and has low variance, as we annotated in the visualized example.Similarly, O 2 has a pattern that resembles the checkerboard in crossing, although it is not the case that the queries ignore non-dependents.However, it is still evident that each query in O 2 identified which two symbols should form pairwise dependents and used similar attention weights to the pair.Do note that although we visualized the unscrambled strings for readability, the similar attention and low variance properties hold for other scrambled strings.
As an additional analysis, we probe the MIX representations to see what information is encoded.One possibility is the count for the symbol occurrences at each timestep, which directly follows from MIX's definition.We decode the MIX embeddings for two possible counting targets: a full counting target that maintains the ongoing tallies for all 3 symbols; a 2-counter3 -based target that maintains the values of memory cells.
Similar to the methodology in Wallace et al. (2019), we used an MLP regressor prober with 1 hidden layer, ReLU activations, and MSE loss.We train the prober for up to 300 epochs with early stopping.On the in-distribution test set, the resulting prober using the full counting target has an MSE of 0.21 and a Pearson correlation of 0.929 between the target and the predicted count values.This contrasts with the 2-counter target which has an MSE of 0.61.We find this to be suggestive that the learned representations contain count information, which may have been useful for solving scramble languages.

Discussion and Conclusion
We empirically studied the Transformer's ability to learn a variety of linguistically significant MC- SLs.The significance of the languages is two-fold: they represent a hypothesized upper bound for the complexity of natural language, and they are the abstractions of the motivating linguistic phenomena.Overall, the Transformers performed well in in-distribution tests and are comparable to LSTMs, but their ability to extrapolate is limited.In our next character prediction experiments, removing the sinusoidal PE alleviated the problem, which is an established empirical finding for some formal languages and natural language, but this technique is not always generalizable to other data nor does it work in the alternative task setup.
Transformers leveraged the attention mechanism to score the similarity between the substrings.In our analyses, the learned self-attention's alignments often reflect the symbol dependency relations within the string, which had been useful for MCSLs because of the rich and complex dependencies in the languages.In a more complex language like MIX, Transformers had implicitly learned some form of counting behavior that may have helped solve the language.
Within the same family of languages spanning across complexity classes, the learned patterns are similar and no significant differences in behaviors are observed in the reduced or added complexity languages.This may suggest that we cannot draw parallels between the MCSG formalisms and Transformer's expressiveness directly, like some other formal models such as circuits (Hao et al., 2022;Merrill et al., 2022) do.However, this work serves as an example of how we may draw inspiration from the rich MCSL scholarships to motivate work in current NLP, as they help us examine the linguistic capacity of current and future NLP models.

Limitations
An empirical study on formal language learning is always inconveniently insufficient, as there is always some string length upper bound that any experiment can get to or reasonably work with, so any conclusions drawn are based on an unintentionally bounded dataset, which could weaken the argument about learnability in general as the dataset might form a language with reduced complexity.
In addition, the roles of the other heads, the feedforward sublayer, etc. are not investigated.Therefore, we cannot definitively say how self-attention directly contributed to inference, despite learning meaningful and interpretable patterns (cf.Wen et al. (2023)).
Ideally, we would complement the empirical findings with theoretical constructions on whether and how the MCSLs can be learned, which is lacking in the current work.However, the empirical results serve as the foundation towards that goal.Especially, the highly interpretable self-attention patterns could inspire us and hint at what the theoretical constructions would look like.

A Additional Experiment Details
Choice of Task In pilot experiments, training crossing and multiple agreements with the binary classification task was unsuccessful, and our analyses suggest that they learned spurious statistical cues.The task was especially difficult for multiple agreements, where only one example is available for each n.For crossing, we tried to use an equal number of positive and negative examples, but that is not enough for the model to rule out alternative wrong hypotheses.On the other hand, if we follow what we did for O 2 and enumerate all possible negative examples in a length range, since crossing has a much wider length range than O 2 , this will lead to an explosion of negative examples and is impractical to work with.Teaching these two sets of languages with the binary classification setup may still be possible, but the negative examples likely need to be carefully curated, so that we avoid an explosion of negative examples over positive examples, but still have enough negative datapoints to help the model eliminate most wrong hypotheses such as the spurious cue ones.

The [EOS] Decision
In the binary classification task, since we are not scoring or generating a string, the decision on whether to add [EOS] to strings is arbitrary.Newman et al. (2020) suggest that without [EOS], models may extrapolate better to longer strings.We tried the setup without [EOS] in pilot experiments but found no significantly better performance for our Transformer models in the studied languages.We chose to include [EOS] and use the [EOS] embeddings as the sentence representations for LSTMs in this task.

B Training Details
The datasets for all languages are generated exactly once, and are used across hyperparameter tuning and the final experiments.We record the random seeds used for generation for reproducibility.We try to enumerate all examples in our chosen length range except for the OOD sets of the scramble languages because of an explosion of datapoints as strings get longer, and we downsample in this case.For MIX, positive examples are capped at

Figure 2 :
Figure 2: Swiss German subordinate clauses allow n accusative NPs before m dative NPs, followed by n corresponding accusative object taking verbs before m corresponding dative object taking verbs.Shieber (1985) defined a homomorphism for Swiss German such that when intersecting with regular wa * b * xc * d * yz yields non-context-free wa n b m xc n d m yz, which contradicts CFL's closure property under intersection with regular languages.
and Swiss German, but not as in MIX = {w ∈ {a, b, c} * | |w| a = |w| b = |w| c }, where |w| σ denotes the number of occurrences of symbol σ in string w, that is, the language of strings with an equal number of a's, b's, and c's, but the symbols can occur in any order, and can thus be seen as scrambled a n b n c n .
A related language O 2 = {w ∈ {a, b, c, d} * | |w| a = |w| c ∧ |w| b = |w| d }, which can be seen as scrambled a n b m c n d m , is also in L(MCFG)(Salvati, 2015).We investigate the two scramble languages using the binary classification task as the model may benefit from seeing the whole string at once to directly model the occurrences of each symbol.For MIX, the positive examples exhaustively enumerate all permutations of a n b n c n in the chosen n range, and the negative examples try to enumerate {w ∈ {a, b, c} * | |w| a ̸ = |w| b ∨ |w| a ̸ = |w| c ∨ |w| b ̸ = |w| c } within the same range to help the model better eliminate most wrong hypotheses.As for O 2 , we enumerate permutations of a n b m c n d m in the chosen n, m range, and the negative examples are from the remaining strings in {a, b, c, d} * that are in the same range.The train-test split is performed over each sequence length separately rather than over the entire dataset, so strings of different lengths appear in all splits.

Figure 3 :
Figure 3: Anti-diagonal alignment for ww R , and forward and backward alignment for ww.

Figure 4 :
Figure 4: www expects six alignments in total, usually distributed across heads, where half are better aligned, and the other half partially aligned.

Figure 6 :
Figure 6: Every occurrence of a query symbol attends to every occurrence of a key symbol using similar attention values, thus low variance among the values.The query symbols also attend to different key symbols with different weights, thus resulting in a grid pattern.

Figure 7 :
Figure 7: MIX resembles multiple agreements in that the attention weights from one query symbol to one key symbol are similar.O 2 resembles the checkerboard in crossing although it does not ignore non-dependents.
Scramble macro F-1 (%): the models performed perfectly for in-distribution tests, but MIX OOD sets are challenging to both models, whereas LSTMs outperformed Transformers for O 2 OOD sets.Note that the three tests for O 2 do not share datapoints.

Table 1 :
Languages this work studies, organized by complexities and basic MCSL constructions each represents or resembles.

Table 2 :
Table 1 summarizes all languages studied in this Palindrome/Copy/2-Copy: Transformers surpass LSTMs for in-distribution tests but fall to random guesses for OOD (OOD null accuracy = 50%).