Second Language Acquisition of Neural Language Models

With the success of neural language models (LMs), their language acquisition has gained much attention. This work sheds light on the second language (L2) acquisition of LMs, while previous work has typically explored their first language (L1) acquisition. Specifically, we trained bilingual LMs with a scenario similar to human L2 acquisition and analyzed their cross-lingual transfer from linguistic perspectives. Our exploratory experiments demonstrated that the L1 pretraining accelerated their linguistic generalization in L2, and language transfer configurations (e.g., the L1 choice, and presence of parallel texts) substantially affected their generalizations. These clarify their (non-)human-like L2 acquisition in particular aspects.


Introduction
Cross-lingual transferability of language models (LMs) has attracted much attention.For example, large English LMs show some translation performance even when using a small amount of non-English languages as training data (Brown et al., 2020;Shi et al., 2023), which indicates the efficient language transfer from English to others.Such cross-lingual transferability has been evaluated by holistic measures, such as perplexity and accuracy on downstream tasks (Papadimitriou and Jurafsky, 2020;Deshpande et al., 2022;Blevins et al., 2022).On the other hand, there is much room for investigating them from linguistic perspectives; e.g., grammatical knowledge acquisition and language transfer tendencies among languages.
In this study, we investigate the cross-lingual transferability of LMs from a perspective of second language (L2) acquisition.Our main research question is how first language (L1) acquisition of LMs affects the efficiency of grammar acquisition in L2.To answer this question, we design an experimental procedure (Section 2): (i) pretraining LMs in a certain language (assumed to be the L1 speakers), (ii) further training them in English as an L2, and (iii) evaluating and analyzing their linguistic generalization in L2.As L1s, we chose four languages with different levels of difficulty in transferring to English, i.e., French, German, Russian, and Japanese.The size of training data is restricted to match the human-like L2 acquisition scenario, which enables better comparison with human L2 acquisition tendencies and, hopefully, provides insights into L2 acquisition from a computational linguistic perspective.
We begin with exploring the inductive biases of several L2 training methods (Section 3).Specifically, we compared some variations of L2 data settings, such as training on only the L2 texts or on L1-L2 translation pairs.We observed that, for example, feeding L1-L2 translation pairs into LMs slowed down their L2 grammar acquisition, compared to only feeding L2 monolingual texts every two epochs.
In our main experiments, we conducted exploratory analyses of the effects of L1 training on L2 grammar acquisition (Section 4).We gained three main insights.First, L1 knowledge promotes better linguistic generalization in L2 (Section 4.1).Second, different L1s incur different generalizations in L2 (Section 4.2).More specifically, Japanese and Russian are far behind French and German, which is consistent with the human-defined difficulty levels of language transfer (Chiswick and Miller, 2004).Third, L1 pretraining gives different effects on different types of grammar items (Section 4.3).In particular, morphological and syntactic items get larger gains than semantic and syntax&semantic items.
In more detail, we analyzed the process of L2 acquisition (Section 5).We investigated how L2 knowledge acquisition progresses (Section 5.1) and found that L2 knowledge acquisition does not progress so much until seeing the whole dataset many times (e.g., 50-100 times), implying their data inefficiency.Furthermore, we also observed the L1 knowledge degrade during L2 acquisition; this motivates us to balance the source-target linguistic knowledge during language transfer.
2 Second language acquisition of LMs Overview: We are interested in how L1 knowledge affects the linguistic generalization of LMs in L2. Figure 1 shows an overview of the experimental procedure.First, in our L1 acquisition simulation, we train LMs on a monolingual corpus of a specific language.Second, in our L2 acquisition simulation, we additionally train the pretrained LMs with a corpus including L2 texts (English).Finally, we evaluate the grammatical judgment ability of the LMs in the L2 (English) using BLiMP (Warstadt et al., 2020).

Language exposure
First and second languages: We used French, German, Russian, and Japanese as L1 and employed English as L2 (Table 1).We expect that the transfer to English becomes more difficult in order of French, German, Russian, and Japanese from multiple perspectives: linguistic distance (Grimes and Grimes, 2002;Chiswick and Miller, 2004)  learning difficulty level2 .

L1 acquisition:
We first train LMs in particular L1 language using a monolingual corpus of approximately 100M words sampled from CC-100 (Conneau et al., 2020;Wenzek, 2020).The corpus size is roughly similar to the number of words exposed to humans during language acquisition.We trained the models with 100 epochs.3

L2 acquisition:
We then further train the L1 LMs under bilingual input (Section 3).We trained the models with 100 epochs, but the intermediate checkpoints will also be analyzed to track the process of L2 acquisition in Section 5. We used Tatoeba (Tiedemann, 2012) 4 as parallel corpora in L2 acquisition.Tatoeba is a multilingual parallel corpus consisting of example sentences originally collected for human foreign language learners.From the L2 acquisition perspective, this amount would be large enough for human learners to learn the top 95% English words in frequency (Nation, 2014).
Note that there would be several scenarios of human L2 learning/acquisition, such as through language classes or natural communications.Following Krashen et al. (1979), we refer to L2 acquisition as the latter scenario of acquiring L2 through natural language exposure, e.g., raw texts.

Learners
We largely followed the settings of the cross-lingual language model (XLM) (Conneau and Lample, 2019), which is typically used in cross-lingual language modeling in the field of natural language processing (NLP).In short, this is a Transformerbased bidirectional LM, but the input consists of bilingual text pairs.The tokens in the bilingual text were randomly masked, and the model predicts the masked tokens on both L1 and L2 sides.During the L1 training, the L1 side is the only input.
The bilingual XLM is trained from scratch (L1 training and L1-L2 training), rather than using the off-the-shelf pre-trained XLM that is trained across dozens of languages (Conneau and Lample, 2019;Conneau et al., 2020).From a cognitive perspective, such a super-multilingual setting is unrealistic since humans hardly face dozens of languages in a multilingual acquisition scenario.Rather, we hope that such a bilingual transfer will gain much attention from the adjacent areas, such as pedagogy and cognitive science of exploring human second language acquisition/learning.
Technically, we randomly initialized the parameters of the XLM (18M), constructed a bilingual tokenizer using byte pair encoding on the bilingual texts, and trained the model separately for each L1-L2 combination.For each L1-L2 setting, we trained four models with different seeds; the results reported in Sections 4 and 5 are the average of scores from four models.See Table 5 in Appendix for the hyperparameters and detailed settings.

Evaluation
Dataset: We used BLiMP (Warstadt et al., 2020), a benchmark of English grammatical judgment test to evaluate the models' L2 linguistic generalization.The dataset consists of 12 test suites; each corresponds to a specific linguistic phenomenon and falls into one of four coarse linguistic categories: morphology, syntax, semantics, and syn-tax&semantics.Each test suite has 1,000 minimal sentence pairs.Each pair consists of grammatically acceptable and unacceptable ones as follows: (1) a.Many teenagers were helping themselves.b. * Many teenagers were helping herself.Grammatical judgement: To select one sentence in each pair, we adopted pseudo-perplexity, commonly used in exploring the linguistic behaviors of LMs (Lau et al., 2020).Specifically, if the model can assign a lower pseudo-perplexity to the grammatical sentence than to the paired ungrammatical one, we regard it as correct.Following Salazar et al. (2020), psuedo-perplexity (PPPL) of sentence s = [w 1 , w 2 , • • • , w n ] is computed using the bidirectional LM θ: where w t denotes the t-th token in sentence, and s \wt denotes all the tokens in the sentence except for w The probability of w t given its bidirectional context s \wt is calculated by the model θ.Based on the selected sentences, we calculated an accuracy score on each test suit of BliMP.We also report the macro-average of accuracies among all the test suits.Note that all the accuracy scores reported in the tables/figures in this paper are multiplied by 100 for readability.

Preliminary experiment: L2 exposure configurations
First, we investigate the inductive bias of L2 training settings.While existing studies use parallel data as an input for cross-lingual training (Conneau and Lample, 2019), we investigate the bias in this setting from L2 grammar acquisition perspectives.
Settings: We set up the three training settings with different input data: (i) L1-L2 text pairs without the translation relationship (TLM-nopara), (ii) L1-L2 translation pairs (TLM-para), and (iii) a mixed setting where L2 text concatenated with L1 parallel text is used as input or only L2 text is used as input (TLM-drop)5 .An overview of the experimental settings is shown in Figure 2 (see details in Appendix A).Note that the original XLM (Conneau and Lample, 2019) adopts a setting similar to the TLM-drop.In this experiment, we report the macro-average BLiMP accuracies across the test suites.Table 2 shows the results (see Table 7 for the results in fine-grained test suits).
Translation pairs does not facilitate L2 acquisition: One notable point in Table 2 is that the results in the TLM-nopara setting were better than those in the TLM-para setting. 6This suggests that parallel data input does not facilitate L2 acquisition.Perhaps, the TLM-para task was too easy for LMs to learn syntactic knowledge; the TLM-para task could partially be solved solely by relying on lexical knowledge, i.e., capturing the lexical correspondences between the tokens in L1 and L2 sentences and predicting the word found only in one of them.In this sense, the TLM-nopara setting, by contrast, might impose a more difficult problem on LMs and promote their effective learning of linguistic knowledge.
Switching between using L2 text with and without its parallel L1 text during training facilitates L2 acquisition: Another notable point is that the TLM-drop was the most effective for acquiring linguistic knowledge in L2 for LMs. 4 Experiments: L1→L2 effects on linguistic generalization We investigate how pretraining with L1 affects L2 grammar acquisition in LMs.We exploratorily compare the linguistic generalization of LM trained in the settings with or without L1 pretraining.Table 3 shows the grammaticality judgment ability after additional training.The OVERALL column indicates the macro-average accuracy score across the grammar items.The ∆ rows show the difference in BLiMP accuracy between models with and without pretraining.Here, the models without pretraining were trained only with bilingual corpus without L1 monolingual corpus pretraining.

L1s promote L2 generalization
Table 3 shows the effect of pretraining with L1 on L2 grammar acquisition.Most of the ∆ values are positive; i.e., the models pretrained with L1s achieved better results than those without pretraining.This demonstrates that pretraining in a particular language generally improves English grammatical ability. 7This positive effect is in light of the assumptions that different languages share some grammatical universals, and learners could use such a common property in language trans- fer (Cook, 1985).Ri and Tsuruoka (2022) also reported that pretraining in a natural language other than English improves the overall syntactic parsing performance in English.Our results are consistent with such positive effects.Besides, our experiment provides the results of each of more fine-grained grammatical items related to morphological, syntactic, and semantic phenomena.

Differences in L1s
The ∆ values in the OVERALL column in Table 3 differ across the L1s.French is the highest, followed closely by German, and Japanese and Russian are far behind these two languages; pretraining in French and German is much more effective than in Japanese and Russian.This ordering shows parallels with the presumed language learning difficulty order: French, German, Russian, and Japanese.This suggests that the difficulty of acquiring an L2 grammatical ability is somewhat close between LMs and humans.

Differences in grammar items
The ∆ scores in Table 3 exhibit that different grammatical items obtain different degrees of gains.Table 4 shows the average ∆ scores for each course grammar category.There was a general tendency for morphological and syntactic items to get larger gains from the L1 pretraining than semantic and syntax&semantic items except for particular settings, e.g., IRREGULAR.It has been shown that linguistic phenomena related to semantics such as NPI (negative polarity item) and QUANTIFIERS were relatively difficult for LMs to learn (Warstadt et al., 2020).Based on this, there is a concern that LMs failed to learn enough such linguistic knowledge in L1 to transfer it to another language.

Differences in L1×grammar-item
Notably, in specific combinations of L1s and grammar items, the L1 pretraining hurt the L2 generalization, i.e., negative transfer problem.For example, the performance in the IRREGULAR item was not so much enhanced or even degraded by L1 pretraining.The IRREGULAR (Irregular forms) item targets English-specific irregular verb conjugations; its less effect by L1 pretraining is due to the concern that the irregular patterns generally could not be predictable by other language's knowledge.We also found that the same grammatical item was affected differently depending on the L1.For example, the ∆ values on the FILLER-GAP item in Table 3 differ across the L1s, e.g., 4.8 in German and 1.1 in Japanese.At least in this FILLER-GAP aspect, there is an interesting parallel between our results and linguistic notions; Japanese is the only language where gap precedes filler in whconstruction among the L1s we used, and the transfer from Japanese to English was indeed limited (∆ = 1.1) compared to other L1s.There might be a possibility that such linguistic (dis)similarities are reflected in the results.Nevertheless, concluding the exact consistency between our L1×grammaritem results (Table 3) and the L1-L2 grammatical similarity requires further interdisciplinary research.

Analysis: acquisition process
This section sheds light on the process of L2 acquisition.We investigate how L2 knowledge acquisition progresses (Section 5.1) and how original L1 knowledge changes during L2 acquisition (Section 5.2).As for the L1 knowledge during L2 acquisition, there is a concern, for example, that LMs exhibit catastrophic forgetting about their L1.

General improvement after dozens of epochs:
The trajectory of the OVERALL scores in Figure 3 suggest that linguistic ability generally improves along with the number of epochs.There was a tendency for large improvements to emerge after dozens of epochs; in other words, the models began to acquire L2 knowledge after seeing the same examples many times, e.g., 50-100 times.Note that humans are argued to acquire a vocabulary after encountering the same word about 12 times (Nation, 2014), and of course, the lexical and syntactic acquisition is not comparable, but the observation that the L2 knowledge improves after 50-100 rounds of the corpus may be in the direction that LMs are inefficient at acquiring a new language.
Differences in grammar items: Focusing on the general trajectory shapes for each grammatical item, we observed at least four patterns: (i) spike-at-the-end (D-N AGR., IRREGULAR, S-V AGR.), (ii) flat (ARG.STR., CTRL.RAIS., ISLAND), (iii) bumpy (ANA.AGR., ELLIPSIS, NPI, QUANTI-FIERS), and (iv) mixed (FILLER-GAP, BINDING).In addition, these groups roughly mirror the linguistic categories of the grammar items (morphology, syntax, semantics, and syntax&semantics); for example, all the items in the spike-at-the-end group are morphological phenomena, while all the semantic categories (NPI, QUANTIFIERS) yielded the bumpy patterns.Note that existing studies reported that low-level (e.g., morphological) linguistic skills could be acquired earlier and vice versa (Liu et al., 2021;Blevins et al., 2022); but at least in our cognitively-inspired bilingual training scenario, we did not observe such an explicit tendency.

L1-only
Inter-L1s differences: In terms of the change of inter-L1s differences of accuracies in each grammar item, there are several different patterns: (i) converging (IRREGULAR, ISLAND), (ii) diverging (ARG.STR., BINDING, D-N AGR., S-V AGR.), and (iii) none of them.Considering IRREGULAR as an example of the converging group, the accuracies were substantially different across the L1s in the initial stage of training; these differences, however, gradually reduced along with epochs.On the other hand, considering S-V AGR. as an example of the diverging group, the accuracies gradually differed in the latter stage of training among the LMs.In the third group, the inter-L1s accuracy differences remain the same or unstable during L2 tranining (ANA.AGR., CTRL.RAIS., ELLIPSIS, FILLER-GAP, NPI, QUANTIFIERS).At least the former two groups imply that pretraining with different L1s differently affects the efficiency of L2 acquisition (e.g., different slopes for different L1s).
To sum up, we clarified that different L1s and grammar items exhibit different learning dynamics of LMs.The cognitive plausibility of these patterns could be the next important investigation.

L2→L1 effects
In contrast to the previous analysis, which analyzed the impact in the L1→L2 direction, we further analyze the L2→L1 impact.In applied linguistics, L1-L2 impact in both directions is of interest; for example, it is reported that the L1 ability is sometimes hurt by the increase of L2 exposure (Kecskes, 2008;Haman et al., 2017).
Settings: In the same way as Section 5.1, we evaluate the grammatical knowledge of LMs during L2 training, but the focus is on the L1 knowledge.We used a multilingual benchmark of grammatical judgment test CLAMS (Mueller et al., 2020).This dataset consists of seven syntactic test suites for several languages.Similarly to BLiMP, the dataset consists of pairs of sentences, where one is grammatical, and the other is ungrammatical.We report the accuracy scores in terms of whether the LMs could assign lower pseudo-perplexity to the grammatically correct sentence.We analyze French-L1, German-L1, and Russian-L1 LMs since this dataset covers these three L1s.
As a baseline, we also evaluate the L1-only LMs that are first pretrained in L1, then additionally trained with the L1-side texts collected from the corresponding parallel corpus; that is, the only difference between bilingual LMs and L1-only baselines was the presence of L2 texts during the L2 acquisition phase.
L2 effects once occur, but diminish: Figure 4 shows the results (see Table 8 for the exact scores).We found that the L1 knowledge is greatly influenced by the L2, especially at the initial stage of L2 training.For example, the French-L1 and German-L1 LMs gained positive effects, and the Russian-L1 LM got a negative influence (the top row of Figure 4).In addition, in the latter stage, the L2 effects on the L1 knowledge gradually fade, in either a good or bad sense, and the L1 linguistic knowledge is converging to the original level.For example, French-L1 and German-L1 LMs exhibited better performance after bilingual language modeling once, e.g., at the 10 epochs, but the gain decreased after further bilingual training.
L2 negatively affects L1 knowledge: In Figure 4), the L1 knowledge in bilingual LMs was competitive or even inferior to that in L1-only LMs in the end, although there were also exceptional cases in specific grammar items in German.For example, in the French-L1 LMs, the L1 syntactic generalization performance after L2 training converged below 0.7 points, while the L1-only baseline model generally achieved above 0.7 points.Contrasting the L1 results with L2 ones (Section 5.1), there is an asymmetry effect between L1 and L2.From a perspective of developing linguistically better multi-lingual LMs, these suggest that balancing the linguistic knowledge of the languages, especially enhancing the L1 knowledge, during language transfer is challenging even if the model is exposed to L1 during bilingual training.Addressing these challenges with, for example, some regularization will be a promising direction from an engineering perspective.

Related work
Cognitively-motivated analysis of neural LMs: Investigations into the ability of neural models in language acquisition began in the 1980s with the interest of whether language can be acquired with-out innate knowledge (Rumelhart and McClelland, 1986;Pinker and Prince, 1988).The initial investigation was made with simple neural networks; after the development of Neural NLP (Manning, 2015), the classical questions posed by cognitive science are currently revisited (Kirov and Cotterell, 2018).A typical movement is a growing interest in the probing of neural LMs' linguistic knowledge (Linzen et al., 2016;Warstadt and Bowman, 2020).In this context, our study analyzes L2 acquisition in neural LMs, while existing studies have typically focused on L1 acquisition.
Language transfer in computational models: Language transfer of NLP models is actively researched from both engineering and scientific perspectives.In the engineering context, to mitigate the English-centric focus of NLP techniques, models that can handle more languages have been developed (Dong et al., 2015;Conneau and Lample, 2019;Conneau et al., 2020).From the scientific perspective, the mechanism and linguistic properties of LMs' language transfer have been explored (Pérez-Mayos et al., 2021;Tyler et al., 2022;Blevins et al., 2022), sometimes beyond the transfer between natural languages (Ri and Tsuruoka, 2022;Papadimitriou and Jurafsky, 2020).One of the motivations of such analyses is to quantify the transferable universals behind (non-)languages.Notably, simulation of L2 acquisition is also explored from pedagogical motivations (Settles et al., 2018).
Language transfer in humans: L2 acquisition/learning has long been studied in applied linguistics, psycholinguistics, and pedagogy fields (Krashen, 1981;Hatch, 1983;Ellis, 2010).These fields articulated several hypotheses/theories on human language learning, e.g., input hypothesis (Krashen, 1977).Analyzing the LMs' L2 acquisition in a more direct light with these hypotheses would be interesting for future work.

Conclusions
We have investigated the L2 acquisition of LMs, especially focusing on their grammatical knowledge in L1 and L2.Specifically, we have trained bilingual LMs under a similar scenario to the human L2 acquisition and then analyzed their cross-lingual transfer.Our experiments have demonstrated that L1 pretraining promotes their linguistic generalization in L2, and there are interesting variations in L1 pretraining effects with respect to the L1 choice, training settings, and grammar items.The results have also implied that their L2 acquisition is not human-like in particular aspects.

Limitations Coverage of experiments
There is room for further exploration in terms of model architectures, data, and evaluation settings in our study.Experiments in more diverse settings would enhance the generality of the conclusions.

Models:
The architecture was fixed to XLM (14M), although we tested four models with different seeds in each setting.Specifically, testing unidirectional LMs will be in a reasonable direction, considering the humans' incremental language processing.Related to this, the measure of pseudoperplexity might also induce unintended biases in our results, this is a common metric in NLP though.In addition, there are different methods to fine-tune the model to multiple languages, e.g., using adapters.Comparing these methods with our scheme will be an interesting direction.
Data: There are possible variations of L1-L2 combinations.Although we selected L1 from one of the four languages (German, French, Russian, Japanese) and fixed L2 as English, increasing the coverage of languages will lead to more generalized conclusions.Furthermore, the performance of LMs was generally not so good on the BLiMP dataset.We suspect that this is due to the limited L2 training data size; it is also worth exploring scaling up the experiments into typical NLP experiments.
Evaluation: While our focus is on morphology, syntax, and semantic generalization, L2 acquisition studies are conducted from broader perspectives, such as the growth of vocabulary size.In addition, our observations are from the perspective of the LMs; the contrast between LMs' and humans' L2 learning is more important from an interdisciplinary perspective.

Performance was overall poor
One reviewer is concerned that our results on the BLiMP are generally near the chance level, and it may be difficult to derive findings from such poor results.We thank the reviewer and would like to share our thoughts and limitations here.
First, comparisons to the chance rate are not always meaningful.In BLiMP, the task is typically to select a correct generalization over an incorrect one.Occasionally, neural models overly prefer incorrect generalizations more than chance level.For instance, in the linear vs. hierarchical generalization contrast, neural models often favor linear, causing accuracy to drop near 0, far below chance (McCoy et al., 2018).In such cases, achieving accuracies around 50 indicates more than random guessing, as models avoid an excessive preference for incorrect generalizations, moving toward a more neutral stance.Thus, we believe that it is also worthwhile to observe how much performance improves from below the chance level.
Furthermore, in BLiMP's finer-grained test suits, our models sometimes exhibit an accuracy of 0 or 100 (resulting in an overall score of around 50.0), highlighting that our models do not always act as random guessing baselines.

A Experimental Procedure
We list the hyperparameters in Table 5.The versions and licenses of used tools and datasets are listed in Table 6.
L1 Acquisition: We used mosesdecoder (Koehn et al., 2007) as French, German and Russian tokenizer, kytea8 (Neubig et al., 2011;Neubig and Mori, 2010) as Japanese tokenizer and segmented words into subwords with fastBPE9 (Sennrich et al., 2016) The dataset was split into train/dev/test in the ratio of 8:1:1.We set 14,000 vocabulary size for any language.We trained our models with 4 parallel GPUs (VRAM 48G), which took 6 days per model.

L2 Acquisition:
We added English tokens from the parallel corpus into BPE codes and vocabulary used in L1 Acquisition and removed duplicated tokens and vocabulary.As for models not using a monolingual corpus, we created the codes and vocabulary using a parallel corpus of both L1 and L2.We used mosesdecoder (Koehn et al., 2007)   we increased the number of vocabularies in the embedding layer, the weights/biases in the final layer were also increased.Our four LMs were trained with different 3 seeds and reported their averages as results.Compared models in our preliminary experiment (Sec.3) are shown in Figure 2. We trained our models with 2-4 GPUs (VRAM 48G), which took around 5 hours per model.

Figure 1 :
Figure1: Experimental Procedure.First, we pretrain the monolingual masked language model on the first language (first language acquisition; L1 acquisition).Then, the model is additionally trained under the bilingual setting (second language acquisition; L2 acquisition).Finally, we analyze the effect of L1 on L2 via a grammatical judgment test in L2.

Figure 3 :
Figure 3: Grammar learning trajectories in each test suite on BLiMP (the L2 side).The x-axis denotes the epoch during L2 acquisition, and the y-axis denotes the accuracy in the corresponding test suite.

Figure 4 :
Figure 4: Grammar learning trajectories in each test suite on CLAM (the L1 side).The x-axis denotes the epoch during L2 acquisition, and the y-axis denotes the accuracy in the corresponding test suite.The upper parts show the scores of the bilingual LMs (L1 pretraining and bilingual training).The lower parts show the scores of the L1-only LMs (L1 pretraining and further training on L1 texts collected from the parallel corpus).

Table 2 :
Performance of bilingual LMs on BLiMP in different training settings.The para.columnindicates whether parallel corpus was used.The drop column indicates whether the L2-side input is removed every other epoch.addition,this might mitigate the training-inference mismatch in evaluating LMs' linguistic knowledge using BLiMP.Here, a single sentence is used as input, which is compatible with the phase of using only L2 text during the TLM-drop training.In subsequent experiments, we will use the TLM-drop setting as it was the most effective training setting for L2 grammar acquisition.

Table 3 :
R A I S .English (L2) grammatical knowledge of bilingual LMs with different L1s.OVERALL indicates the macro-average accuracy over all linguistic phenomena.The rows with ✓in the L1 column exhibit the accuracy of bilingual LMs on BLiMP.The rows with ∆ in the L1 column exhibit the performance difference between the LMs with and without L1 pretraining.The coarse categories (e.g., Morphology) are from the metadata of BLiMP.

Table 5 :
Hyperparameters of the LMs

Table 6 :
The versions and licenses of used tools and datasets.These tools and datasets used in this study were designed for the purposes of research and language learning.

Table 7 :
Results for each fine-grained test suit in BLiMP.

Table 8 :
Results for each fine-grained test suit in CLAMS.