Syntactic Inductive Bias in Transformer Language Models: Especially Helpful for Low-Resource Languages?

A line of work on Transformer-based language models such as BERT has attempted to use syntactic inductive bias to enhance the pretraining process, on the theory that building syntactic structure into the training process should reduce the amount of data needed for training. But such methods are often tested for high-resource languages such as English. In this work, we investigate whether these methods can compensate for data sparseness in low-resource languages, hypothesizing that they ought to be more effective for low-resource languages. We experiment with five low-resource languages: Uyghur, Wolof, Maltese, Coptic, and Ancient Greek. We find that these syntactic inductive bias methods produce uneven results in low-resource settings, and provide surprisingly little benefit in most cases.


Introduction
Many NLP algorithms rely on high-quality pretrained word representations for good performance.Pretrained Transformer language models (TLMs) such as BERT/mBERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), XLM-R (Conneau et al., 2020), and ELECTRA (Clark et al., 2020) provide state-of-the-art word representations for many languages.However, these models require on the order of tens of millions of tokens of training data in order to achieve a minimum of quality (Micheli et al., 2020;Warstadt et al., 2020), a data requirement that most languages of the world cannot practically satisfy.
There are at least two basic approaches to addressing this issue.The first, which is at least as old as BERT, exploits multilingual transfer to reduce the data requirements for any individual language.The second aims to reduce TLMs' data requirements by modifying their architectures and algorithms.For example, Gessler and Zeldes (2022) more effectively train low-resource monolingual TLMs with as few as 500K tokens by reducing model size and adding supervised pretraining tasks with part-of-speech tags and syntactic parses.
We take up the latter direction in this work, looking specifically at whether the addition of syntactic inductive bias (SIB) during the pretraining procedure may help improve TLM quality in lowresource, monolingual settings.Specifically, we examine two methods which have been proposed for high-resource settings: the two syntactic contrastive loss functions of Zhang et al. (2022b), and the modified self-attention algorithm of Li et al. (2021), wherein a modified self-attention mechanism, restricted so that tokens may only attend to tokens that are syntactically "local", complements the standard self-attention mechanism.
At a high level, SIB is of interest in the context of TLMs because of how crucial self-attention is for TLMs' syntactic knowledge.In studies on an English TLM, BERT, Htut et al. (2019) and Clark et al. (2019) show that while syntactic relations are not directly recoverable from self-attention patterns, many self-attention heads seem to be sensitive to particular syntactic relations, such as that of a direct object or or a subject.But self-attention is completely unbounded: during pretraining, the model has to learn from scratch how to decide which other tokens in an input sequence a token should attend to.We therefore observe that if SIB could be effectively applied, then presumably self-attention weights would converge more quickly and learn more effectively, since their behavior has been observed to be so heavily syntactic in nature.
Moreover, we expect that this effect would be greater for low-resource languages, where the comparative lack of data is known to hamper models' ability to form robust linguistic representations.We find additional motivation for our interest in SIB given the nearly universal view held by linguists that the human mind does not start with the equivalent of a totally unconstrained self-attention mechanism: for example, psycholinguists such as Hawkins (2014) have extensively documented processing-related constraints on syntax, and Generative linguists such as Ross (1967) have observed that many syntactic constructions which might have been possible are in fact not attested in English or any other language, and postulate that these constructions are at least in some cases "impossible" because of biologically-determined properties of the human mind.Our goal is therefore to give our models something like the constraints the human mind has in order to help them learn more effectively with less data.
We use a standard BERT-like TLM architecture as our base model, though we heavily reduce model size, following the results of Gessler and Zeldes (2022) which showed that this is beneficial in lowresource monolingual settings.We pretrain TLMs for five low-resource languages-Wolof, Coptic, Maltese, Uyghur, and Ancient Greek-varying which SIB methods are used.We then use Universal Dependencies (UD) (Nivre et al., 2016) syntactic parsing and WikiAnn (Pan et al., 2017) named entity recognition as representative downstream tasks that allow us to assess the quality of our models.Additionally, we evaluate our models using PrOnto (Gessler, 2023), a suite of downstream task datasets for low-resource languages.We find that these SIB methods are not very effective in lowresource languages, with small gains in some tasks and degradations or no effects in others.This is surprising given the intuition that SIB ought to help more in low-resource settings, and we speculate that other methods for SIB may be more effective in low-resource settings.
We summarize our contributions as follows: 1. We conduct what is, to the best of our knowledge, the first work examining whether SIB is helpful for pretraining low-resource Transformer LMs. 2. We reimplement SynCLM (Zhang et al., 2022b), SLA (Li et al., 2021), and MicroBERT (Gessler and Zeldes, 2022) in plain PyTorch and make it openly accessible. 1 3.We present evidence from seven downstream evaluation tasks wherein the two SIB methods we examine are basically ineffective in our experimental settings, yielding only scattered and small gains.
Throughout this period, high-resource languages have received the majority of attention, and although interest in low-resource settings has increased in the past few years, there remains a large gap (in terms of linguistic resources, pretrained models, etc.) between low-and high-resource languages (Joshi et al., 2020).

Multilingual Models
The first modern multilingual TLM was mBERT, trained on 104 languages (Devlin et al., 2019).mBERT and other models that followed it, such as XLM-R (Conneau et al., 2020), demonstrated that multilingual pretrained TLMs are capable of good performance not on just languages represented in their training data, but also in some zero-shot settings (cf.Pires et al. 2019;Rogers et al. 2020, among others).But this is not without a cost: it has been shown (Conneau et al., 2020) that when a TLM is trained on multiple languages, the languages compete for parameter capacity in the TLM, which effectively places a limit on how many languages can be included in a multilingual model before performance significantly degrades for some or all of the model's languages.Indeed, the languages which had proportionally less training data in XLM-R's training set tended to perform more poorly (Wu and Dredze, 2020).
A possible solution to this difficulty is to adapt pretrained TLMs to a given target language, rather than trying to fit the target language into an ever-growing list of languages that the model is pretrained on.One popular method for doing this involves expanding the TLM's vocabulary with additional subword tokens (e.g.BPE tokens for RoBERTa-style models), which has been observed to improve tokenization and reduce out-ofvocabulary rates (Wang et al., 2020;Artetxe et al., 2020;Chau et al., 2020;Ebrahimi and Kann, 2021), leading to downstream improvements in model performance.But these and other approaches struggle when a language is very far from any other language that a multilingual TLM was pretrained on.
Multilingual models like XLM-R which are trained on over 100 languages could be described as massively multilingual models.A more recent trend is to train multilingual models on just a few to a couple dozen languages, especially in lowresource settings.For example, Ogueji et al. (2021) train an mBERT on data drawn from 11 African languages, totaling only 100M tokens (cf.BERT's 3.3B), and find that their model outperforms massively multilingual models such as XLM-R, presumably because the African languages in question were quite unrelated to most of the languages XLM-R was trained on.

Monolingual Models
There has been comparatively little work exploring pretraining monolingual low-resource TLMs from scratch, and this lack of interest is likely explainable by the fact that monolingual TLMs require copious training data in order to be effective.Several studies have examined the threshold under which monolingual models significantly degrade, and all find that using standard methods, more data than is available in "low-resource" settings (definitionally, if we take "low-resource" to mean 'no more than 10M tokens') is required in order to effectively train a monolingual TLM.Martin et al. (2020) find at least 4GB of text is needed for near-SOTA performance in French, and Micheli et al. (2020) show further for French that at least 100MB of text is needed for "well-performing" models on some tasks.Warstadt et al. (2020) train English RoBERTa models on datasets ranging from 1M to 1B tokens and find that while models acquire linguistic features readily on small datasets, they require more data to fully exploit these features in generalization on unseen data.Gessler and Zeldes (2022) is the only work we are aware of which attempts to develop a method for training "low-resource" (<10M tokens in training data) monolingual TLMs.They extend the typical MLM pretraining process with multitask learning on part-of-speech tagging and UD syntactic parsing, and also radically reduce model size to 1% of BERT-base, yielding fair performance gains on two syntactic evaluation tasks.They find that their monolingual approach generally outperforms multilingual methods for languages that are not represented in the training set of a multilingual TLM (mBERT, in their study).

Syntactic Inductive Bias
Other work has investigated the syntactic capabilities of TLMs, and whether these capabilities could be enhanced with additional inductive bias.In an influential study, Hewitt and Manning (2019) find that structures that resemble undirected syntactic dependency graphs are recoverable from TLM hidden representations using a simple "structural probe", consisting of a learned linear transformation and a minimum spanning tree algorithm for determining tokens' syntactic dependents based on L2 distance.Kim et al. (2020) find similar results with a non-parametric, distance-based approach using both hidden representations and attention distributions.Both of these works attempt to find syntactic representations within a TLM without ever exposing a TLM to a human-devised representation.The quality of the recovered trees is usually poor relative to those obtainable from a syntactic parser, though their quality is consistently higher than random baselines.Some works have attempted to provide models with direct access to human-devised representations-e.g., a syntactic parse provided in the Universal Dependencies formalism, which may have been produced by a human or by an automatic parser.Zhou et al. (2020) extend BERT by adding dependency and constituency parsing as additional supervised tasks during pretraining.Bai et al. (2021) assume that inputs are paired with parses, and use the parses to generate masks which restrict an ensemble of self-attention modules to attend only to syntactic children, parents, or siblings.Xu et al. (2021) use dependency parses to bias selfattention so that self-attention between tokens is weighted proportionally to the tokens' distance in the parse.In this paper, we examine the methods of Li et al. (2021) and Zhang et al. (2022b), which we describe below.
In sum, there are very many ways in which one could encourage a TLM to either learn a human representation of syntax, or to come up with (or reveal) its own.To our knowledge, none of the works on SIB have been examined in a low-resource TLM pretraining setting.

Approach
This work investigates whether methods for SIB that have succeeded in high-resource monolingual TLM pretraining settings could also be useful in analogous low-resource settings.As we have seen, monolingual TLMs tend to have very poor quality when less than ≈10M tokens of training data are available for pretraining, and moreover, it has been observed that at least one dimension of this poor quality is models' inability to make grammatical generalizations without a large (≈1B tokens, Warstadt et al. 2020) pretraining dataset.Since it is (almost definitionally) difficult to get more data in low-resource settings, it is especially important to find other ways of improving model quality.It is therefore worthwhile to examine whether supplying some kind of SIB could help a low-resource TLM form better linguistic representations.
As discussed in §2.3, there are many ways to introduce SIB into a TLM.In this work, we look specifically at two methods: SynCLM (Zhang et al., 2022b) and SLA (Li et al., 2021), which is also used by Zhang et al. Li et al. (2021) extend the selfattention module with "local attention", wherein tokens may only attend to tokens which are ≤ k edges away in the dependency parse tree.Zhang et al. (2022b) devise two contrastive loss functions which are intended to encourage tokens to attend to sibling and child tokens, and in their experiments, they find success in combining these with SLA.A concise description of the details of each method is available in Appendix A.Both of these methods have only been evaluated on English, and both assume a UD syntactic parse as an additional input for each input sequence and use the parse in different ways to attempt to guide the model to better syntactic representations.
We use these two SIB methods with the model of Gessler and Zeldes (2022), MicroBERT, as a foundation.MicroBERT is a BERT-like model that has been scaled down to 1% of BERT-base, and that optionally employs part-of-speech tagging and syntactic parsing as auxiliary pretraining tasks.As shown by experiments on 7 low-resource languages conducted by Gessler and Zeldes (2022), MicroBERT performs much better than an unmodified BERT-base TLM, so we adopt it as our baseline model for most experiments in this work.
We now state our two main research questions:

Data and Evaluation
We reuse the datasets and evaluation setup of Gessler and Zeldes (2022), using five of their seven "truly"2 low-resource languages' datasets.Each language's data includes a large collection of unlabeled pretraining data sourced from Wikipedia, as well as two datasets for downstream tasks for evaluation: UD treebanks for syntactic parsing, and WikiAnn (Pan et al., 2017) for named entity recognition (NER).We refer readers to Gessler and Zeldes' paper for further details on these datasets and the models for UD parsing and NER.In addition, we assess models on all five tasks in the PrOnto benchmark (Gessler, 2023), which will be described below.

Models
We reimplement the MicroBERT model of Gessler and Zeldes (2022), as well as the work of Zhang et al. (2022b) and Li et al. (2021).In all cases, we reuse code wherever possible and closely check implementation details and behavior in order to ensure correctness.As a foundation, we use the BERT implementation provided in HuggingFace's transformers package (Wolf et al., 2020), and we also use AI2 Tango3 for running experiments.We obtain all of our parses for the unlabeled portions of our datasets automatically using Stanza (Qi et al., 2020), following Zhang et al.
In order to answer our research questions, for each language, we examine the following conditions: 1. MBERT -plain multilingual BERT (bert-base-multilingual-cased).
A baseline; numbers taken from Gessler and Zeldes.2. MBERT-VA -MBERT, but with vocabulary augmentation.A baseline; numbers taken from Gessler and Zeldes.
3. µB-M -plain MicroBERT trained only using MLM.We obtain our own numbers to verify the correctness of our implementation.4. µB-MP, µB-MT, µB-MPT -MicroBERT with either one or both of the SynCLM loss functions: P indicates the phrase-guided loss, and T indicates the tree-guided loss. 5. µB-MPT-SLA -µB-MPT, with the addition of SLA.We follow Zhang (2022) in using SLA only in conjunction with both contrastive losses.6. µB-MX, µB-MXP, µB-MXT, µB-MXPT, µB-MXPT-SLA-the conditions in (3-5), but with the addition of part-of-speech tagging (X) as an auxiliary pretraining task.This is done using the same methods of Gessler and Zeldes: PoS tagging is only performed on gold-tagged data from the UD treebank, and tagged sequences are mixed into the pretraining data at a 1 to 8 ratio.
Revisiting our research questions, we intend for the conditions in (3-5) to provide evidence for (RQ1), and for the additional information from the conditions in (6) to provide evidence for (RQ2).

Results
Parsing Our results for UD syntactic parsing are given in Table 2.While all models beat the multilingual baselines, neither SynCLM nor SLA seems to improve model quality.In the -M variant models, the top-performing model is always the one trained with plain masked language modeling.This is not so for the -MX variant models, where the -MXP and -MXPT models do slightly better on average, though this difference is small enough to be within the range of experimental noise.Surprisingly, -MPT-SLA models do worst of all.Finally, comparing -M variants to their -MX counterparts, we do find that in all cases the -MX counterpart is better on average, and that the difference is about 1% LAS.
NER Our results for WikiAnn NER are given in Table 3. Considering the -M variant models first, we see that in all cases the model trained using only MLM performs the worst, and the -MPT-SLA variant, while always no better than the -MP, -MT, and -MPT variants, also outperforms the plain MLM model.The -MP, -MT, and -MPT variants do best with a difference of up to 4 points F1 on average.
Turning now to the -MX variants, while it is still true that on average the plain MLM model performs worst and the non-SLA SynCLM models perform best, there is more variation within individual languages.The best model for Uyghur is the plain MLM model, and for Maltese, the plain MLM model outperforms µB-MXT and µB-MXPT.
Considering now all the NER results, two patterns are worth noticing.First, unlike in parsing, a -MX variant does not always outperform its -M counterpart: for example, µB-MP for Wolof is better than µB-MXP by a difference of 5 points F1.We can see further that the -M models beat the -MX models on average by about 4 points F1.This indicates that when combined with SLA and SynCLM, the PoS tagging pretraining task does not appear to be helpful for dimensions of model quality that are implicated in NER.Second, the addition of -SLA never results in a gain relative to any of the SynCLM models, except for Uyghur, where it produces a gain of 0.09, which is within the range of experimental noise.
PrOnto We run our SynCLM models 4 on all five tasks of PrOnto (Gessler, 2023) on all languages except Maltese, which is not represented in PrOnto because of the lack of an open-access Maltese Bible.For each language in PrOnto, a dataset for five sequence classification tasks is available which was constructed by aligning New Testament verses from the target language with the English verse in OntoNotes (Hovy et al., 2006) and projecting annotations from English to the target language.All 5 tasks are sequence classification tasks.Each task requires a model to predict a certain grammatical or semantic property-these are, respectively: the number of referential noun phrases in a sequence; whether the subject of a sentence contains a proper noun; the sentential mood of a sentence; whether two input sequences both contain a usage of a verb sense; and whether two input sequences both contain a usage of a verb sense with the same number of arguments.We refer readers to the PrOnto publication for further details.
Results from two of the five tasks are given in 4 It was not possible to run our SLA models on PrOnto due to considerable implementation effort that would have been required, so we omit those models from this evaluation.Table 4. 5 Broadly, we may observe that the -MPT and -MXPT models never perform best within a language, with either variant being in many cases worse by a few absolute points compared to other models.Looking at -M-family models, -MP is the clear winner, doing a little better than -M and much better than -MT or -MPT on both tasks.By contrast, for -MX-family models, the -MXP variant does a bit worse on average than -MX, and for the Same Sense task, the -MXT model does a bit better than -MXP.Looking to the rightmost column in Table 4, we can see that when we average accuracy scores for a model across all languages and all 5 tasks in PrOnto, the -MP model has the highest score overall, with -MX and -M very close behind and all other model variants quite a ways behind.
Overall, it seems that for the PrOnto tasks, of all the syntactic bias methods we have tried, only the use of the phrase-based contrastive loss (-MP) or the tree-based contrastive loss in combination with PoS tagging (-MXT) showed much improvement over the baselines.In individual languagetask combinations, models sometimes had multiplepoint performance differences over others, but when considered in aggregate, only -MP shows any improvement over -M and -MX-by 0.15% and 0.01% accuracy, respectively.

Discussion
Considering first whether SynCLM and SLA yield benefits for low-resource monolingual TLMs (RQ1), we have found positive evidence from the WikiAnn NER experiments, and weak positive evidence from the PrOnto experiments.It is true that the same methods did not produce measurable gain for the UD parsing task, but this is in line with previous findings for these two methods, where on some downstream evaluations, gain was very small or slightly negative-we return to this matter in the following paragraph.For the question of whether these benefits are complementary with the PoS tagging pretraining strategy introduced in Gessler and Zeldes (2022) (RQ2), we do not find consistent evidence in any of our experiments that both PoS tagging and SynCLM or SLA yield complementary benefits.The only positive evidence we find for this is in the PrOnto experiments, where the -MXT model variant does better than -MX in some task-language combinations, though worse overall.
The difference in the way model variants behaved in these seven evaluation tasks is striking, and it is difficult to understand why models exhibited these different behaviors.It is worth comparing these results with those reported by the Syn-CLM authors (Zhang et al., 2022b).For many of the GLUE tasks that they assess their models on (their Table 3), there is little or no improvement from adding -P, -T, or -PT-SLA.For example, considering their models based on RoBERTabase, none of their model variants outperform the MLM-only baseline for the QQP (Quora Question Pairs2), STS (Semantic Textual Similarity), or MNLI-m (Multi-Genre Natural Language Inference, matched).This situation is more or less analogous to the one we observed in our experiments for the UD parsing downstream task, where the addition of SynCLM and SLA had basically no effect.
On the other hand, the GLUE task with the greatest gain, CoLA (Corpus of Linguistic Acceptability), shows a difference of only 1.7% Matthews correlation coefficient, and a couple of other tasks like SST (Stanford Sentiment Treebank), show an improvement of only 0.3% accuracy.It would be naïve to directly compare percentage points of different metrics in totally different experimental settings and make conclusions about effect sizes, we nevertheless point out that we observe improvements of 1-4% F1 in our NER experiments for -M models.In light of this, we consider our results to be broadly in line with the trend for previous works' results on English: there is no improvement that is wholly consistent across evaluations, and only modest gains for the benchmarks that do improve.
In summary, we find that SynCLM and SLA produce uneven results in low-resource settings, though we also find that when they do succeed, they can yield gains that appear greater than anything observed for high-resource languages: we saw that when we take a pure MLM pretraining regimen as a base and add SynCLM and/or SLA, we are able to improve the quality of pretrained TLMs by 1 to 4 absolute points F1 in NER.While a similar benefit was not observed for UD parsing, it is also true that there was a noticeable degradation on UD parsing in only a couple cases, and in most cases simply had no effect.

English Experiments
One might have expected SIB to be a knockout success for low-resource languages given the intuitive feeling that at lower data volumes, additional bias ought to be more helpful.We considered reasons why our attempts to do this might not have panned out-perhaps, for example, tree structure matters most for highly analytic languages like English, or perhaps the tasks used to evaluate English in GLUE are more sensitive to high-level sentence structure, or perhaps sensitivity to syntax is only advantageous given a base model with sufficiently rich distributional information.Here, we consider another possible explanation: that the inductive bias with these methods only helps given high-quality syntactic parses.An obvious difference between English and the languages we have examined in this study is that UD parsers for English generally achieve much higher performance given the size and annotation quality of English UD treebanks.This is a potentially consequential difference, given that both the SynCLM and SLA methods rely on UD parse trees as inputs.In addition, the models we have developed here differ from common kinds of English BERTs in that they are much smaller and were trained on much less data, and it is possible that the SynCLM and SLA methods might have interactions with these two variables of model construction.
In order to investigate whether parse tree quality, model size, and pretraining data size might be consequential for these SIB methods, we run several additional experiments on English datasets.We choose English because its status as a highresource language allows us control over several independent variables which we do not have control over in low-resource settings, namely data quantity, syntactic parse quality, and model size. 6We can frame an additional research question that we wish to answer: • (RQ3) Are SynCLM and SLA sensitive to parse tree quality, model size, or pretraining dataset size?For our English dataset, we use AMALGUM (Gessler et al., 2020) as our source of pretraining data.AMALGUM contains around 2M tokens and contains automatic parses with quality that exceeds what can normally be obtained from a standard parser.For downstream evaluation, we use the English Web Treebank (Silveira et al., 2014), which contains around 250K tokens, and the English split of WikiAnn, downsampled to around 50K tokens in order to bring it closer to the quantities for our other 3 languages (cf.Table 1).In addition, we use a 100M subset of BERT's pretraining data as a larger source of unlabeled pretraining data.
We frame these additional conditions for English, extending our model naming scheme from above: 1. -NP -syntax trees are taken from Stanza in the same way as before.2. -HQP -syntax trees are taken from AMAL-GUM's annotations, made by a high quality parser.
3. -BD -pretraining is done using the big dataset instead of AMALGUM.4. -BD-BM -like -BD, and in addition, the model size is set to half of BERT-base (6 layers instead of 12).Evidence from these conditions could tell us more about how and when SynCLM and SLA can succeed in low-resource scenarios.We pretrain these models as we did in our main experiments and evaluate them on UD parsing and WikiAnn NER.
A full description of our results is given in Appendix B, and we give a description of our key finding here: that SynCLM and SLA are not very sensitive to parse quality or model size, but are sensitive to quantity of pretraining data.The insensitivity to parse quality may come as a surprise, and we reason that this is actually understandable, since both methods focus mostly on low-height subtrees (often corresponding to phrase-or sub-phrase-level constituents) which are more likely to be correct even when overall parse quality is bad.We find evidence for sensitivity to data size in the fact that SynCLM and SLA provide gains of up to 1% F1 for the NER evaluation in the two low-data conditions, while in the higher-data conditions, all but one of the bias-enhanced models lead to degradations relative to the baseline.In sum, we take this to show that lower parse quality is not the major reason for the ineffectiveness of SynCLM and SLA in low-resource settings.

Conclusion
In this work, we have taken two methods for SIB that have succeeded in English, SynCLM and SLA, and we have investigated whether they may also be beneficial in low-resource monolingual settings.We find that in most cases these methods do not result in an improvement in model quality as measured on seven tasks.Further, in our auxiliary experiments on English, we found evidence suggesting that the lower quality of parses in low-resource settings is probably not what is driving the ineffectiveness of these SIB methods.
Considering all of our results, we conclude that these two specific methods-SynCLM and SLAare not well suited to supporting the pretraining of language models in low-resource settings, but we also view it as a yet open question whether any method for SIB could succeed in this role.There are some reasons why SynCLM and SLA might have been unhelpful.First of all, recall the fact that SynCLM limits its application to only short subtrees (no taller than 3 nodes).This would mean that most of the time, the contrastive loss functions would only be operating on basic phrase-level constituents, such as noun phrases, and not higher, clause-level phenomena such as relations between the main clause's predicate and its arguments.If it were the case that the former kind of syntax is relatively easy for models to learn even with limited data, and that the latter kind of syntax is what is hard and therefore where SIB really ought to help, then we would expect to see the results we found in this work, where neither method did much to help.
Therefore, while we find little reason to be optimistic about these two particular methods in lowresource settings, we don't view the evidence in this paper as an indictment of SIB in low-resource settings in general, and suggest that SIB methods which are better able to provide bias for higher, clause-level syntactic dependencies may produce better results for low-resource languages.

A Summary of SLA and SynCLM
Our approach critically relies on two previous results, which we summarize here.

A.1 Syntax-aware Local Attention
Li et al. ( 2021) introduce Syntax-aware Local Attention (SLA), a variation on a standard TLM self-attention mechanism that retains standard selfattention and complements it with a separate selfattention mechanism where each token may only attend to "syntactically local" tokens.
Recall that BERT and most other TLMs use scaled dot-product attention in every attention head, where the attention distribution A can be computed with query and key representations Q and K, d is the size of an individual attention head's hidden representation, and the attention head's output O is the product of A and the value representation V: Now, assume an input sequence W = w 1 ,...,w n with an unlabeled dependency parse H = h 1 ,...,h n where h i indexes token w i 's syntactic head.Define syntactic distance between two words, D(w i ,w j ), as the length of the shortest path between the two words in the parse: To account for the fact that parses may be inaccurate (e.g. if they come from an automatic parser), define windowed syntactic distance like so: 7 This can be viewed as sacrificing precision for recall: a decision to give tokens a better chance of being able to attend to truly local tokens (given the imperfection of parser outputs), though at the cost of sometimes allowing attention on tokens that truly are not local.Now, define a mask matrix M that will mask a token iff a token j has windowed syntactic distance over a certain threshold δ relative to token i: We can now define syntax-aware local attention by modifying Equation 1 so that M is added to the inner term in order to force an attention score of 0 for masked tokens: Syntax-aware local attention (SLA) is used alongside the normal, "global" self-attention.To combine the two after they have been computed, introduce a gated unit for each Transformer block with new parameters W g and b g to compute g i for each word w i using the word's hidden representation h i , where σ is the sigmoid function: Now, use g i to interpolate both the normal attention distribution a i and the local attention distribution a ℓ i at each position i in the sequence to yield the final attention distribution Â and final attention head output Ô: In the original work, the SLA method is evaluated on various benchmarks on English and consistently achieves measurable improvements in model quality.Parses are obtained using Stanza (Qi et al., 2020), which for English are of quite high quality (labeled attachment score is in the mid-80s for English datasets).We refer readers to the original publication for further details.See Figure 1 for an overview.A.2 SynCLM Zhang et al. (2022b) present the Syntax-guided Contrastive Language Model (SynCLM), a BERTlike TLM that characteristically uses two novel contrastive loss functions and also uses SLA (cf.appendix A.1). Intuitively, a contrastive learning objective requires each instance to have one or more positive and negative "samples", and attempts to maximize the instance's similarity to positive samples and minimize its similarity to negative samples (Zhang et al., 2022a).SynCLM uses a popular loss function for this, InfoNCE (van den Oord et al., 2018): q, q + , and q − are the representations of the instance, a positive sample, and a negative sample, respectively, and τ ∈ (0,1) is a temperature hyperparameter, set to 0.1 for SynCLM.sim is a similarity function, such as cosine similarity or KL-divergence.The loss terms obtained from this equation are simply added to the loss obtained from masked language modeling.We review only the contrastive objective functions here, and refer readers to Figure 2 and the original paper for further details.
The two SynCLM contrastive learning objectives are distinguished by how they formulate sim.
The first, "phrase-guided" objective aims to make attention distributions more similar for words in the same phrase.Given a token t, sample a positive token t + such that t and t + have a lowest common ancestor t a whose corresponding subtree (the "phrase") is no more than 2 in height.Now sample k negative tokens t − 1 ,...,t − k outside the phrase, i.e. who do not have t a as an ancestor.Define sim phrase using Jensen-Shannon Divergence (Endres and Schindelin, 2003), a similarity metric for probability distributions: Here, a is the attention distribution for t, and a ′ is the attention distribution for either a positive or a negative sample.This equation is used to calculate similarities for a given attention head and layer-in SynCLM's implementation, only the last layer is used, and sim phrase is averaged across all attention heads in the last layer before being used with Equation 10 for the final loss computation.
The "tree-guided" objective proceeds similarly.A token t i is sampled which forms the root of the positive tree, T + .Next, up to three tokens t − 1 ,...,t − k are sampled such that each t − i is not in T + but is adjacent to a token in T + .A new negative subtree T − i is formed for each t − i such that a random nonroot token in T + has been removed from T + along with its children, and the subtree rooted at t − i has taken its place.
We may now define tree similarity as follows, where T is a positive or a negative subtree and z a is the hidden representation of token a: sim tree = cossim(z i ,∑ t j ∈T child e i j z j ) where Informally, we are taking the dot product of the root of the subtree with all other tokens in the subtree, softmaxing this dot product, using it to produce a weighted sum of all hidden representations of tokens in the subtree, and taking the cosine similarity between this weighted sum and the root of the subtree.The closer these tokens' representations are in the hidden space, the higher this similarity measure will be.Again, SynCLM uses only the last TLM layer for this objective, and this similarity measure is used with Equation 10.Note that in a preprocessing step, parses are modified so that subword tokens are syntactic children of the head token of the word they belong to.8

B English Experiments
Parsing Parsing results are given in Table 5.First note that as before, there is little difference in model quality across all the SynCLM conditions, providing more evidence that the SynCLM losses are not helpful for UD parsing.Next, as could be expected, the model trained with 100M tokens that is half the size of BERT-base performs best.What is surprising, however, is that of the remaining 3 models, the model with the standard parser performs best.Since all three of these variants are alike in model hyperparameters, this must be explainable in terms of properties of the three datasets.
It could be that AMALGUM's very deliberate construction from eight genres in equal proportion could have led to serendipitously good performance on the parsing task, but it is impossible to know without further experimentation.At any rate, whatever the differences in these three variants might be caused by that lies in the data, we still have a firm answer for our most important question: for English UD parsing, SynCLM and SLA methods appear not to be sensitive to data quantity or parse quality.The latter might be surprising, but it is worth remembering that the authors of these methods designed their algorithms in ways that may mitigate the deleterious effects of lowerquality syntactic parses.SLA uses windowed syntactic distance (cf.Equation 4 in Appendix A) for the express purpose of accommodating bad parses, and the SynCLM losses place low limits on tree height, which would help in accommodating bad parses since edges at the local, phrase level are often more reliable than edges at the clausal or inter-clausal level.
NER Results on NER are given in Table 6.Surprisingly, the same half-sized BERT model that was trained on 100M tokens and did best in the parsing evaluation does very poorly in the NER task.We suspect that this may be due to the fact that larger models can show greater instability in fine-tuning setups (Rogers et al., 2020).As with parsing, we see that the -NP model performs best among the MicroBERT-sized models, which we ascribe to differences in properties of the pretraining datasets.
What is most interesting in the NER results is that for the two low-data conditions, -NP and -HQP, we see about a 1% gain in the -MPT condition relative to the MLM-only baseline.This gain is not seen in the higher-data conditions, where none of the SynCLM combinations lead to a better model except for µB-MPT-BD, with a gain of 0.45% F1.Complicating this picture, though, is that in the low-data settings, the -MP and -MT variants often underperform relative to the baseline.Still, these results seem to indicate at least that the SynCLM loss functions may be less effective in improving model quality as quantity of pretraining data increases.We can see that this holds both for the half-sized BERT model as well as the MicroBERTsized model, indicating that model size does not matter.
Discussion Returning to RQ3, these results indicate that SynCLM and SLA are not especially sensitive to parse quality, and are also not sensitive to model size, but are sensitive to quantity of pretraining data.As discussed above, the insensitivity to parse quality is understandable, as the dimensions in which a parse may be bad are less relevant for these methods because of the way they use the parse trees.The sensitivity to pretraining data quantity is intuitive if we consider these two methods as sources of inductive bias: an inductive bias ought to be pushing a model towards learning something that they would have learned if there were more training data available, and so we should expect that if we consider a modification to be an inductive bias, its influence should wane as the quantity of data increases.In sum, these findings support our conclusion that SynCLM and SLA are at least in some respects well-suited to aid the pretraining of TLMs in lowresource settings, as we have found that even when parse quality is worse than ideal, SynCLM and SLA still perform about as well as when they have the highest quality parses.

C Limitations
The goal of this paper is to make progress towards more effective TLMs for low-resource languages using syntactic inductive bias.We believe we have presented compelling evidence that two approaches to this problem seem not to be very effective for low-resource languages.But it is important to point out that we have tested the methods on only 5 languages.We believe that this forms an informative picture for low-resource languages in general because these languages are quite different from one another along typological and phylogenetic dimensions, but in principle, it is conceivable that other low-resource languages could exhibit behaviors that are very different from the ones we have seen in this paper.Moreover, we have had to reimplement the methods at the center of this work, and while we have done everything we can to ascertain that these re-implementations have been faithful and without error, tensor programming is error-prone work, and it is not impossible that we may have introduced a bug somewhere which critically affected the experimental results in this work.

Figure 1 :
Figure 1: Figure 1 fromLi et al. (2021).The standard self-attention mechanism is complemented by another self-attention mechanism in which tokens may only attend to tokens close to it in a parse tree.A gated unit with learnable parameters interpolates the two attention distributions before the distribution is combined with the Value representation.

Figure 2 :
Figure 2: Figure 1 fromZhang et al. (2022b).P and N i represent the positive sample and the ith negative sample, respectively.The phrase-based contrastive loss on the left is intended to make the representations of syntactic siblings more similar, and the tree-based contrastive loss on the right is intended to make the representations of syntactic children and parents more similar.

Table 1 :
Token count for each dataset by language from Gessler and Zeldes (2022), sorted in order of increasing unlabeled token count.
• (RQ1) Do these SIB methods improve model quality when applied to a low-resource language?• (RQ2) Are there any gains complementary with the part-of-speech tagging component of MicroBERT for training low-resource monolingual TLMs?

Table 2 :
Gessler and Zeldes (2022)(LAS) by language and model combination for UD parsing evaluation.Results for MBERT and MBERT-VA are taken fromGessler and Zeldes (2022).

Table 3 :
score by language and model combination for NER evaluation.

Table 4 :
Gessler (2023)nguage and model combination for two tasks in PrOnto: the Non-pronominal Mention Count, and Same Sense tasks.For non-baseline models, an underline indicates the best performance for a languagetask combination for a particular model variant (-M or -MX), and boldface indicates the best performance across either model variant.Scores for MBERT, µB-M*, and µB-MX* are taken from Gessler (2023)-the asterisk indicates that the latter two models are not our implementation but the one provided in Gessler and Zeldes (2022), which is reported inGessler (2023).Rightmost column contains an average over all languages and tasks for a given model.Results for PrOnto's other three tasks are given in Appendix D.

Table 5 :
Labeled attachment score (LAS) for English.

Table 6 :
score by language and model combination for NER evaluation.

Table 7 :
Gessler (2023)ldes (2022)model combination for Proper Noun Subject in PrOnto.Scores for MBERT, µB-M*, and µB-MX* are taken from Gessler (2023)the asterisk indicates that the latter two models are not our implementation but the one provided inGessler and Zeldes (2022), which is reported inGessler (2023). MBERT

Table 9 :
Gessler (2023)ldes (2022)model combination for Same Argument Count in PrOnto.Scores for MBERT, µB-M*, and µB-MX* are taken from Gessler (2023)the asterisk indicates that the latter two models are not our implementation but the one provided inGessler and Zeldes (2022), which is reported inGessler (2023).