Uncovering Constraint-Based Behavior in Neural Models via Targeted Fine-Tuning

A growing body of literature has focused on detailing the linguistic knowledge embedded in large, pretrained language models. Existing work has shown that non-linguistic biases in models can drive model behavior away from linguistic generalizations. We hypothesized that competing linguistic processes within a language, rather than just non-linguistic model biases, could obscure underlying linguistic knowledge. We tested this claim by exploring a single phenomenon in four languages: English, Chinese, Spanish, and Italian. While human behavior has been found to be similar across languages, we find cross-linguistic variation in model behavior. We show that competing processes in a language act as constraints on model behavior and demonstrate that targeted fine-tuning can re-weight the learned constraints, uncovering otherwise dormant linguistic knowledge in models. Our results suggest that models need to learn both the linguistic constraints in a language and their relative ranking, with mismatches in either producing non-human-like behavior.


Introduction
Ever larger pretrained language models continue to demonstrate success on a variety of NLP benchmarks (e.g., Devlin et al., 2019;Brown et al., 2020). One common approach for understanding why these models are successful is centered on inferring what linguistic knowledge such models acquire (e.g., Linzen et al., 2016;Hewitt and Manning, 2019;Hu et al., 2020;Warstadt et al., 2020a). Linguistic knowledge alone, of course, does not fully account for model behavior; non-linguistic heuristics have also been shown to drive model behavior (e.g., sentence length; see McCoy et al., 2019;Warstadt et al., 2020b). Nevertheless, when looking across a variety of experimental methods, models appear to acquire some grammatical knowledge (see Warstadt et al., 2019).
However, investigations of linguistic knowledge in language models are limited by the overwhelming prominence of work solely on English (though see Gulordava et al., 2018;Ravfogel et al., 2018;Mueller et al., 2020). Prior work has shown nonlinguistic biases of neural language models mimic English-like linguistic structure, limiting the generalizability of claims founded on English data (e.g., Dyer et al., 2019;Davis and van Schijndel, 2020b). In the present study, we show via cross-linguistic comparison, that knowledge of competing linguistic constraints can obscure underlying linguistic knowledge.
Our investigation is centered on a single discourse phenomena, implicit causality (IC) verbs, in four languages: English, Chinese, Spanish, and Italian. When an IC verb occurs in a sentence, interpretations of pronouns are affected: (1) a. Lavender frightened Kate because she was so terrifying. b. Lavender admired Kate because she was so amazing.
In (1), both Lavender and Kate agree in gender with she, so both are possible antecedents. However, English speakers overwhelmingly interpret she as referring to Lavender in (1-a) and Kate in (1-b). Verbs that have a subject preference (e.g., frightened) are called subject-biased IC verbs, and verbs with an object preference (e.g., admired) are called object-biased IC verbs. IC has been a rich source of psycholinguistic investigation (e.g., Garvey and Caramazza, 1974;Hartshorne, 2014;Williams, 2020). Current accounts of IC ground the phenomenon within the linguistic signal without the need for additional pragmatic inferences by comprehenders (e.g., Ro-hde et al., 2011;Hartshorne et al., 2013). Recent investigations of IC in neural language models confirms that the IC bias of English is learnable, at least to some degree, from text data alone (Davis and van Schijndel, 2020a;Upadhye et al., 2020). The ability of models trained on other languages to acquire an IC bias, however, has not been explored. Within the psycholinguistic literature, IC has been shown to be remarkably consistent crosslinguistically (see Hartshorne et al., 2013;Ngo and Kaiser, 2020). That is, IC verbs have been attested in a variety of languages. Given the crosslinguistic consistency of IC, then, models trained on other languages should also demonstrate an IC bias. However, using two popular model types, BERT based (Devlin et al., 2019) and RoBERTa based (Liu et al., 2019), 1 we find that models only acquired a human-like IC bias in English and Chinese but not in Spanish and Italian.
We relate this to a crucial difference in the presence of a competing linguistic constraint affecting pronouns in the target languages. Namely, Spanish and Italian have a well studied process called pro drop, which allows for subjects to be 'empty' (Rizzi, 1986). An English equivalent would be "(she) likes BERT" where she can be elided. While IC verbs increase the probability of a pronoun that refers to a particular antecedent, pro drop disprefers any overt pronoun in subject position (i.e. the target location in our study). That is, both processes are in direct competition in our experiments. As a result, Spanish and Italian models are susceptible to overgeneralizing any learned pro-drop knowledge, favoring no pronouns rather than IC-conditioned pronoun generation.
To exhibit an IC bias, models of Spanish and Italian have two tasks: learn the relevant constraints (i.e. IC and pro drop) and the relative ranking of these constraints. We find that the models learn both constraints, but, critically, instantiate the wrong ranking, favoring pro drop to an IC bias. Using fine-tuning to demote pro drop, we are able to uncover otherwise dormant IC knowledge in Spanish and Italian. Thus, the apparent failure of the Spanish and Italian models to pattern like English and Chinese is not evidence on its own of a model's inability to acquire the requisite linguistic 1 These model types were chosen for ease of access to existing models. Pretrained, large auto-regressive models are largely restricted to English, and prior work suggests that LSTMs are limited in their ability to acquire an IC bias in English (Davis and van Schijndel, 2020a).
knowledge, but is in fact evidence that models are unable to adjudicate between competing linguistic constraints in a human-like way. In English and Chinese, the promotion of a pro-drop process via fine-tuning has the opposing effect, diminishing an IC bias in model behavior. As such, our results indicate that non-human like behavior can be driven by failure either to learn the underlying linguistic constraints or to learn the relevant constraint ranking.

Related Work
This work is intimately related to the growing body of literature investigating linguistic knowledge in large, pretrained models. Largely, this literature articulates model knowledge via isolated linguistic phenomena, such as subject-verb agreement (e.g., Linzen et al., 2016;Mueller et al., 2020), negative polarity items (e.g., Marvin and Linzen, 2018;Warstadt et al., 2019), and discourse and pragmatic structure (including implicit causality; e.g., Ettinger, 2020; Schuster et al., 2020;Jeretic et al., 2020;Upadhye et al., 2020). Our study differs, largely, in framing model linguistic knowledge as sets of competing constraints, which privileges the interaction between linguistic phenomena.
Prior work has noted competing generalizations influencing model behavior via the distinction of non-linguistic vs. linguistic biases (e.g., Mc-Coy et al., 2019;Davis and van Schijndel, 2020a;Warstadt et al., 2020b). The findings in Warstadt et al. (2020b), that linguistic knowledge is represented within a model much earlier than attestation in model behavior, bears resemblance to our claims. We find that linguistic knowledge can, in fact, lie dormant due to other linguistic processes in a language, not just due to non-linguistic preferences. Our findings suggest that some linguistic knowledge may never surface in model behavior, though further work is needed on this point.
In the construction of our experiments, we were inspired by synthetic language studies which probe the underlying linguistic capabilities of language models (e.g., McCoy et al., 2018;Ravfogel et al., 2019). We made use of synthetically modified language data that accentuated, or weakened, evidence for certain linguistic processes. The goal of such modification in our work is quite similar both to work which attempts to remove targeted linguistic knowledge in model representations (e.g., Ravfogel et al., 2020;Elazar et al., 2021) and to work which investigates the representational space of models via priming (Prasad et al., 2019;Misra et al., 2020).
In the present study, rather than identifying isolated linguistic knowledge or using priming to study relations between underlying linguistic representations, we ask how linguistic representations interact to drive model behavior.

Models
Prior work on IC in neural language models has been restricted to autoregressive models for ease of comparison to human results (e.g., Upadhye et al., 2020). In the present study, we focused on two popular non-autoregressive language model variants, BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019). We used existing models available via HuggingFace (Wolf et al., 2020). Multilingual models have been claimed to perform worse on targeted linguistics tasks than monolingual models (e.g., Mueller et al., 2020). We confirmed this claim by evaluating mBERT which exhibited no IC bias in any language. 2 Thus, we focus in the rest of this paper on monolingual models (summarized in Table 1). For English, we used the BERT base uncased model and the RoBERTa base model. For Chinese, we evaluated BERT and RoBERTa models from Cui et al. (2020). For Spanish, we used BETO (Cañete et al., 2020) and Ru-PERTa (Romero, 2020). For Italian, we evaluated an uncased Italian BERT 3 as well as two RoBERTa based models, UmBERTo (Parisi et al., 2020) and GilBERTo (Ravasio and Di Perna, 2020 Each verb in the human experiment was coded for IC bias based on continuations of sentence fragments (e.g., Kate accused Bill because ...). For Spanish, we used the IC verbs from Goikoetxea et al. (2008), which followed a similar paradigm as Ferstl et al. (2011) for English. Participants were given sentence fragments and asked to complete the sentence and circle their intended referent. The study reported the percent of subject continuations for 100 verbs, from which we used the 61 verbs which had a significant IC bias (i.e. excluding verbs with no significant subject or object bias).
For Italian, we used the 40 IC verbs reported in Mannetti and De Grada (1991). Human participants were given ambiguous completed sentences with no overt pronoun like "John feared Michael because of the kind of person (he) is" and were asked to judge who the null pronoun referred to, with the average number of responses that gave the subject as the antecedent reported. 5 For Chinese, we used 59 IC verbs reported in Hartshorne et al. (2013), which determined average subject bias per verb in a similar way as Mannetti and De Grada (1991) (i.e. judgments of antecedent preferences given ambiguous sentences, this time with overt pronouns). 6 We generated stimuli using 14 pairs of stereotypical male and female nouns (e.g., man vs. woman, husband vs. wife) in each language, rather than rely on proper names as was done in the human experiments. The models we investigated are bidirectional, so we used a neutral right context, was there, for English and Spanish, where human ex-4 All stimuli, as well as code for reproducing the results of the paper are available at https://github.com/ forrestdavis/ImplicitCausality . For each language investigated, the stimuli were evaluated for grammaticality by native speakers with academic training in linguistics. 5 Specifically, Mannetti and De Grada (1991) grouped the verbs into four categories and reported the average per category as well as individual verb results for the most biased verbs and the negative/positive valency verbs. Additionally, figures showing average responses across various conditions was reported for one of the categories. From the combination of this information, the average scores for all but two verbs were able to be determined. The remaining two verbs were assigned the reported average score of their stimuli group. periments provided no right context. 7 For Italian we utilized the full sentences investigated in the human experiments. The Chinese human experiment also used full sentences, but relied on nonce words (i.e. novel, constructed words like sliktopoz), so we chose instead to generate sentences like the English and Spanish ones. All stimuli had subjects and objects that differed in gender, such that all nouns occurred in subject or object position (i.e. the stimuli were fully balanced for gender): (2) the man admired the woman because [MASK] was there. 8 The mismatch in gender forced the choice of pronoun to be unambiguous. For each stimulus, we gathered the scores assigned to the third person singular male and female pronouns (e.g., he and she). 9 Our measures were grouped by antecedent type (i.e. the pronoun refers to the subject or the object) and whether the verb was object-biased or subject-biased. For example, BERT assigns to (2) a score of 0.01 for the subject antecedent (i.e. he) and 0.97 for the object (i.e. she), in line with the object-bias of admire.

Models Inconsistently Capture Implicit Causality
As exemplified in (1), repeated below, IC verb bias modulates the preference for pronouns.
(3) a. Lavender frightened Kate because she was so terrifying. b. Lavender admired Kate because she was so amazing.
An object-biased IC verb (e.g., admired) should increase the likelihood of pronouns that refer to the object, and a subject-biased IC verb (e.g., frightened) should increase the likelihood of reference to the subject. Given that all the investigated stimuli were disambiguated by gender, we categorized our results by the antecedent of the pronoun and the IC verb bias. We first turn to English and Chinese, which showed an IC bias in line with existing work on IC bias in autoregressive English models (e.g., Upadhye et al., 2020;Davis and van Schijndel, 2020a). We then detail the results for Spanish and Italian, where only very limited, if any, IC bias was observed.

English and Chinese
The results for English and Chinese are given in Figure 1 and detailed in Appendix B. All models demonstrated a greater preference for pronouns referring to the object after an object-biased IC verb than after a subject-biased IC verb. 10 Additionally, they had greater preferences for pronouns referring to the subject after a subject-biased IC verb than after a object-biased IC verb. That is, all models showed the expected IC-bias effect. Generally, there was an overall greater preference for referring to the object, in line with a recency bias, with the exception of RoBERTa, where subject-biased IC verbs neutralized the recency effect.

Spanish and Italian
The results for Spanish and Italian are given in Figure 2 and detailed in Appendix B. In stark contrast to the models of English and Chinese, an IC bias was either not demonstrated or was only weakly attested. For Spanish, BETO showed a greater preference for pronouns referencing the object after an object-biased IC verb than after a subject-biased IC verb. There was no corresponding IC effect for pronouns referring to the subject, and RuPERTa (a RoBERTa based model) had no IC effect at all. Italian BERT and GilBERTo (a RoBERTa based model) had no significant effect of IC-verb on pronouns referring to the object. There was a significant, albeit very small, increased score for pronouns referring to the subject after a subject-biased IC verb in line with a weak subject-IC bias. Similarly, UmBERTo (a RoBERTa based model) had significant, yet tiny IC effects, where object-biased IC verbs increased the score of pronouns referring to objects compared to subject-biased IC verbs (conversely with pronouns referring to the subject).
Any significant effects in Spanish and Italian were much smaller than their counterparts in English (as is visually apparent between Figure 1 and Figure 2), and each of the Spanish and Italian models failed to demonstrate at least one of the IC effects.

Pro Drop and Implicit Causality: Competing Constraints
We were left with an apparent mismatch between models of English and Chinese and models of Spanish and Italian. In the former, an IC verb bias modulated pronoun preferences. In the latter, the same IC verb bias was comparably absent. Recall that, for humans, the psycholinguistic literature suggests that IC bias is, in fact, quite consistent across languages (see Hartshorne et al., 2013). We found a possible reason for why the two sets of models behave so differently by carefully considering the languages under investigation. Languages can be thought of as systems of competing linguistic constraints (e.g., Optimality Theory; Prince and Smolensky, 2004). Spanish and Italian exhibit pro drop and typical grammatical sentences often lack overt pronouns in subject position, opting instead to rely on rich agreement systems to disambiguate the intended subject at the verb (Rizzi, 1986). This constraint competes with IC, which favors pronouns that refer to either the subject or the object. Chinese also allows for empty arguments (both subjects and objects), typically called discourse pro-drop (Huang, 1984). 11 As the name suggests, however, this process is more discourse constrained than the process in Spanish and Italian. For example, in Chinese, the empty subject can only refer to the subject of the preceding sentence (see Liu, 2014). As a means of comparison, in surveying three Universal Dependencies datasets, 12 8% of nsubj (or nsubj:pass) relations were pronouns for Chinese, while only 2% and 3% were pronouns in Spanish and Italian respectively. English lies on the opposite end of the continuum, requiring overt pronouns in the absence of other nominals (cf. He likes NLP and *Likes NLP). Therefore, it's possible that the presence of competing constraints in Spanish and Italian obscured the underlying IC knowledge: one constraint preferring pronouns which referred to the subject or object and the other constraint penalizing overt pronouns in subject positions (i.e. the target position masked in our experiments). In the following sections, we removed or otherwise demoted the dominance of each model's pro-drop constraint for Spanish and Italian, and introduced or promoted a pro-drop like constraint in English and Chinese. We found that the degree of IC bias in model behavior could be controlled by the presence, or absence, of a competing pro-drop constraint.

Methodology
We constructed two classes of dataset to fine-tune the models on. The first aimed to demote the pro-  (Silveira et al., 2014), and for Chinese, we used the Traditional Chinese Universal Dependencies Treebank annotated by Google (GSD) and the Chinese Parallel Universal Dependencies (PUD) corpus from the 2017 CoNLL shared task (Zeman et al., 2017).
For demoting pro drop, we found finite (i.e. inflected) verbs that did not have a subject relation in the corpora. 13 We then added a pronoun, matching the person and number information given on the verb, alternating the gender. For Italian, this amounted to a dataset of 3798 sentences with a total of 4608 pronouns (2,284 he or she) added. For parity with Italian, we restricted Spanish to a dataset of the first 4000 sentences, which had 5,559 pronouns (3,573 he or she) added. For the addition of a pro-drop constraint in English and Chinese, we found and removed pronouns that bore a subject relation to a verb. This amounted to 935 modified sentences and 1083 removed pronouns (774 he or she) in Chinese and 4000 modified sentences  (2008) and Mannetti and De Grada (1991) and 5984 removed pronouns (2188 he or she) in English. 14 For each language, 500 unmodified sentences were used for validation, and unchanged versions of all the sentences were kept and used to fine-tune the models as a baseline to ensure that there was nothing about the data themselves that changed the IC-bias of the models. Moreover, the fine-tuning data was filtered to ensure that no verbs evaluated in our test data were included. Fine-tuning proceeded using HuggingFace's API. Each model was finetuned with a masked language modeling objective for 3 epochs with a learning rate of 5e-5, following the fine-tuning details in (Devlin et al., 2019). 15

Demoting Pro Drop: Spanish and Italian
As a baseline, we fine-tuned the Spanish and Italian models on unmodified versions of all the data we used for demoting pro drop. The baseline results are given in Figure 3. We found the same qualitative effects detailed in Section 5.2, confirming that the data used for fine-tuning on their own did not produce model behavior in line with an IC bias.
We turn now to our main experimental manipu-14 A fuller breakdown of the fine-tuning data is given in Appendix A with the full training and evaluation data given on our Github. We restricted English to the first 4000 sentences for parity with Italian/Spanish. Using the full set of sentences resulted in qualitatively the same pattern. We used the maximum number of sentences we could take from Chinese UD. lation: fine-tuning the Spanish and Italian models on sentences that exhibit the opposite of a pro-drop effect. It is worth repeating that the fine-tuning data shared no verbs or sentence frames with our test data. The results are given in Figure 4. Strikingly, an object-biased IC effect (pronouns referring to the object were more likely after objectbiased IC verbs than subject-biased IC verbs) was observed for Italian BERT and GilBERTo despite no such effect being observed in the base models. Moreover, both models showed a more than doubled subject-biased IC verb effect. UmBERTo also showed increased IC effects, as compared to the base models. Similarly for Spanish, a subjectbiased IC verb effect materialized for BETO when no corresponding effect was observed with the base model. The object-biased IC verb effect remained similar to what was reported in Section 5.2. For RuPERTa, which showed no IC knowledge in the initial investigation, no IC knowledge surfaced after fine-tuning. We conclude that RuPERTa has no underlying knowledge of IC, though further work should investigate this claim.
Taken together these results indicate that simply fine-tuning on a small number of sentences can rerank the linguistic constraints influencing model behavior and uncover other linguistic knowledge (in our case an underlying IC-bias). That is, model behavior can hide linguistic knowledge not just because of non-linguistic heuristics, but also due to over-zealously learning one isolated aspect of linguistic structure at the expense of another.

Promoting Pro Drop: English and Chinese
Next, we fine-tune a pro-drop constraint into models of English and Chinese. Recall that both models showed an IC effect, for both object-biased and subject-biased IC verbs. Moreover, both languages lack the pro-drop process found in Spanish and Italian (though Chinese allows null arguments). As with Spanish and Italian, we fine-tuned the English and Chinese models on unmodified versions of the training sentences as a baseline (i.e. the sentences kept their pronouns) with the results given in Figure 5. There was no qualitative difference from the IC effects noted in Section 5.1. That is, for both English and Chinese, pronouns referring to the object were more likely after objectbiased IC verbs than after subject-biased IC verbs, and conversely pronouns referring to the subject were more likely after subject-biased than objectbiased IC verbs.
The results after fine-tuning the models on data mimicking a Spanish and Italian like pro-drop process (i.e. no pronouns in subject position) are given in Figure 6 and detailed in Appendix B. Despite fine-tuning on only 0.0004% and 0.003% of the data RoBERTa and BERT were trained on, respectively, the IC effects observed in Section 5.1 were severely diminished in English. However, the subject-biased IC verb effect remained robust in both models. For Chinese BERT, the subjectbiased IC verb effect in the base model was lost and the object-biased IC verb effect was reduced. The subject-biased IC verb effect was similarly attenuated in Chinese RoBERTa. However, the object-biased IC verb effect remained.
For both languages, exposure to relatively little pro-drop data weakened the IC effect in behavior and even removed it in the case of subject-biased IC verbs in Chinese BERT. This result strengthens our claim that competition between learned linguistic constraints can obscure underlying linguistic knowledge in model behavior.

Discussion
The present study investigated the ability of RoBERTa and BERT models to demonstrate knowledge of implicit causality across four languages (recall the contrast between Lavender frightened Kate and Lavender admired Kate in (1)). Contrary to humans, who show consistent subject and objectbiased IC verb preferences across languages (see Hartshorne et al., 2013), BERT and RoBERTa models of Spanish and Italian failed to demonstrate the full IC bias found in English and Chinese BERT and RoBERTa models (with our English results supporting prior work on IC bias in neural models and extending it to non-autoregressive models; Upadhye et al., 2020;Davis and van Schijndel, 2020a). Following standard behavioral probing (e.g., Linzen et al., 2016), this mismatch could be interpreted as evidence of differences in linguistic knowledge across languages. That is, model behavior in Spanish and Italian was inconsistent with predictions from the psycholinguistic IC literature, suggesting that these models lack knowledge of implicit causality. However, we found that to be an incorrect inference; the models did have underlying knowledge of IC.
Other linguistic processes influence pronouns in Spanish and Italian, and we showed that competition between multiple distinct constraints affects model behavior. One constraint (pro drop) decreases the probability of overt pronouns in subject position, while the other (IC) increases the probability of pronouns that refer to particular antecedents (subject-biased verbs like frightened favoring subjects and object-biased verbs like admired favoring objects). Models of Spanish and Italian, then, must learn not only these two con-straints, but also their ranking (i.e. should the model generate a pronoun as IC dictates, or generate no pronoun in line with pro drop). By fine-tuning the models on data contrary to pro drop (i.e. with overt pronouns in subject position), we uncovered otherwise hidden IC knowledge. Moreover, we found that fine-tuning a pro-drop constraint into English and Chinese greatly diminished IC's influence on model behavior (with as little as 0.0004% of a models original training data).
Taken together, we conclude that there are two ways of understanding mismatches between model linguistic behavior and human linguistic behavior. Either a model fails to learn the necessary linguistic constraint, or it succeeds in learning the constraint but fails to learn the correct interaction with other constraints. Existing literature points to a number of reasons a model may be unable to learn a linguistic representation, including the inability to learn mappings between form and meaning and the lack of embodiment (e.g., Bender and Koller, 2020;Bisk et al., 2020). We suggest that researchers should re-conceptualize linguistic inference on the part of neural models as inference of constraints and constraint ranking in order to better understand model behavior. We believe such framing will open additional connections with linguistic theory and psycholinguistics. Minimally, we believe targeted fine-tuning for constraint re-ranking may provide a general method both to understand what linguistic knowledge these models possess and to aid in making their linguistic behavior more human-like.

Conclusion and Future Work
The present study provided evidence that model behavior can be meaningfully described, and understood, with reference to competing constraints. We believe that this is a potentially fruitful way of reasoning about model linguistic knowledge. Possible future directions include pairing our behavioral analyses with representational probing in order to more explicitly link model representations and model behavior (e.g., Ettinger et al., 2016;Hewitt and Liang, 2019) or exploring constraint competition in different models, like GPT-2 which has received considerable attention for its apparent linguistic behavior (e.g., Hu et al., 2020) and its ability to predict neural responses (e.g., Schrimpf et al., 2020).

A Additional Fine-tuning Training Information
The full breakdown of pronouns added or removed in the fine-tuning training data are detailed below. English can be found in Table 2, Chinese can be found in Table 3, Spanish can be found in Table 4, and Italian can be found in Table 5.

B Expanded Results (including mBERT)
The full details of the pairwise t-tests conducted for the present study are given below (including the results for mBERT). The results for English models are in Table 6, for Chinese models Table  7, for Spanish models Table 8, and Italian models  Table 9.  Table 6: Results from pairwise t-tests for English across the investigated models. O-O refers to object antecedent after object-biased IC verb and O-S to object antecedent after subject-biased IC verb (similarly for subject antecedents S-O and S-S). CI is 95% confidence intervals (where positive is an IC effect). BERT BASE and BERT PRO refer to models fine-tuned on baseline data and data with a pro-drop process respectively.    Table 9: Results from pairwise t-tests for Italian across the investigated models. O-O refers to object antecedent after object-biased IC verb and O-S to object antecedent after subject-biased IC verb (similarly for subject antecedents S-O and S-S). CI is 95% confidence intervals (where positive is an IC effect). BERT BASE and BERT PRO refer to models fine-tuned on baseline data and data with a pro-drop process respectively.