Language Models Use Monotonicity to Assess NPI Licensing

We investigate the semantic knowledge of language models (LMs), focusing on (1) whether these LMs create categories of linguistic environments based on their semantic monotonicity properties, and (2) whether these categories play a similar role in LMs as in human language understanding, using negative polarity item licensing as a case study. We introduce a series of experiments consisting of probing with diagnostic classifiers (DCs), linguistic acceptability tasks, as well as a novel DC ranking method that tightly connects the probing results to the inner workings of the LM. By applying our experimental pipeline to LMs trained on various filtered corpora, we are able to gain stronger insights into the semantic generalizations that are acquired by these models.


Introduction
Neural language models (LMs) have become powerful approximators of human language, making it increasingly important to understand the features and mechanisms underlying their behavior . In the past few years, a substantial number of studies have investigated the linguistic capabilities of LMs (Gulordava et al., 2018;Giulianelli et al., 2018;Lakretz et al., 2019;Wu et al., 2020;Ettinger, 2020, i.a.). Such work has focused primarily on syntactic properties, while fewer studies have been done on what kind of formal semantic features are encoded by language models. In this paper, we focus explicitly on what LMs learn about a semantic property of sentences, and in what ways their knowledge reflects wellknown features of human language processing.
As the topic of our studies, we consider monotonicity, a semantic property of linguistic envi-1 All code and data can be found at https://github. com/jumelet/monotonicity-npi-lm ronments that plays an important role in human language understanding and inference (Hoeksema, 1986;Valencia, 1991;Van Benthem, 1995;Icard III and Moss, 2014): the monotonicity of a linguistic environment determines whether inferences from a general to a particular term or vice versa are valid in that environment. For example, the fact that the inference from "Mary didn't write a paper" to "Mary didn't write a linguistics paper" is valid shows us that the position where "a paper" occurs is downward monotone: the inference is valid when a more general term ("a paper") is replaced with a more specific one ("a linguistics paper").
To investigate monotonicity we focus on negative polarity items (NPIs): a class of expressions such as any or ever that are solely acceptable in downward monotone environments (Fauconnier, 1975;Ladusaw, 1979). Psycholinguistic research has confirmed this connection between NPIs and monotonicity: humans judge NPIs acceptable in a linguistic environment if they consider that environment to be downward monotone (Chemla et al., 2011). Previous research has established that LMs are relatively successful in processing NPIs (Warstadt et al., 2019), but without investigating how they came to these successes.
We raise the following research questions: RQ1 Do language models encode the monotonicity properties of linguistic environments?
RQ2 To what extent do they employ this information when processing negative polarity items?
We developed a series of experiments, in which we first evaluate the general capacities of LMs in handling monotonicity and NPIs and then investigate the generalization heuristics of the LM by doing experiments with modified training corpora. First, we establish that LMs are able to encode a notion of monotonicity by probing them with diagnostic classifiers (DCs, ) ( §5.1).
In our second experiment we demonstrate that our LMs are reasonably successful with NPI licensing using an NPI acceptability task ( §5.2). Next, we introduce a novel DC ranking method to investigate the overlap between the information that the model uses to make judgments about NPIs and the information that the DCs use to predict monotonicity information, finding that there is a significant overlap ( §5.3).
We then investigate two potential confounds that may obfuscate our results. First, we consider whether the signal that is picked up by the monotonicity DC is not simply a proxy that tells the model that an NPI may occur at that position ( §5.4). To assess this, we train new LMs on a corpus from which all sentences with NPIs have been removed, re-run the montonicity probing task, and find that even in the absence of NPI information, LMs are still able to encode a notion of monotonicity.
Next, we consider whether an LM bases its NPI predictions on simple co-occurrence heuristics, or if it can extrapolate from a general notion of monotonicity to cases of NPIs in environments in which they have never been encountered during training ( §5.5). We again train new LMs on modified corpora, this time removing NPIs only in one specific environment, and repeat the NPI acceptability and DC ranking experiments. The results of this setup demonstrate that LMs indeed use a general notion of monotonicity to predict NPI licensing.
Contributions With this work, we contribute to the ongoing study of the linguistic abilities of language models in several ways: • With a series of experiments we demonstrate that LMs are able to acquire a general notion of monotonicity that is employed for NPI licensing.
• We present two novel experimental setups: filtered corpus training and DC ranking, that can be used to assess the impact of specific information during training and compare the information used by DCs with the information used with the model, respectively.
• By using experimental results from psychosemantics to motivate hypotheses for LM behavior, we find that our models reflect behavior similar to human language processing.
In the remainder of this paper, we will first provide some linguistic background that helps to situate and motivate our experiments and results ( §2).
We then discuss related work on NPI processing in LMs in §3. In §4, we discuss our methods and experimental setup. §5.1 through §5.5 explain and present the results. We conclude in §6 with a general discussion and pointers to future work.

Linguistic Background
Monotonicity Monotonicity is a property of a linguistic environment which determines what kind of inferences relating general and particular terms are valid in that environment. If inferences from a general to a particular term are valid, the linguistic environment is said to be downward monotone (DM). If inferences are valid the other way around, from a particular to a general term, the linguistic environment is said to be upward monotone (UM).
Examples of expressions inducing DM environments are negation and quantifiers like nobody, no NP, but also specific types of adverbs and the antecedents of conditional sentences. For instance, (1) below exemplifies that in these environments the inference from a sentence with a general term (cookies) to that sentence with a more particular term (chocolate cookies) is valid, but not vice versa.
(1) a. Mary didn't eat cookies. Common examples of UM environments are (nonquantified) positive sentences, quantifiers such as somebody, many NP, and other kind of adverbs.
(2) exemplifies that in these environments the inference from a sentence with a more particular term (chocolate cookies) to the same sentence with a general term (cookies) is valid, but not vice versa.
(2) a. NPIs NPIs are expressions such as the English words any, anyone, ever, whose acceptability depends on whether its linguistic environment is downward monotone (Fauconnier, 1975;Ladusaw, 1979;Dowty, 1994;Kadmon and Landman, 1993;Krifka, 1995;Lahiri, 1998;Chierchia, 2006Chierchia, , 2013 If we again consider the DM environment of (1-a) and the UM environment of (2-a), it can be seen that English any is an NPI, as it is acceptable when inside the syntactic scope of negation (a DM expression) as in (3-a), and not acceptable when they are in an UM environment as in (3-b).
Importantly, monotonicity plays a role at the psychological level: human judgments about the monotonicity of a linguistic environment predict their judgments of NPI acceptability in that environment (Chemla et al., 2011;Denić et al., 2021). For example, how plausible someone finds the inference (1-a) predicts how acceptable they find the sentence (3-a). Summing up, NPI licensing has a syntactic component (NPIs must reside in syntactic scope of a licensor) and a semantic component (NPI licensors are DM expressions), that are connected on a psychological level (monotonicity judgments predict NPI acceptability). Our research aims to uncover whether this connection is exhibited by LMs as well.

Related work
The literature on interpreting LMs has grown substantially in the last few years (see, e.g. Belinkov and Glass, 2019;Alishahi et al., 2019;Rogers et al., 2021, for survey papers). Several studies investigate how they process NPIs, focused mainly on the syntactic aspect of NPI licensing. Jumelet and Hupkes (2018) conclude that LSTM language models encode information about the dependency between the NPI and the NPI licensor, although this effect diminishes as the distance between the NPI and its licensor grows. Marvin and Linzen (2018) study NPI judgments of LMs on minimally different sentence pairs (with the NPI licensor either in an appropriate syntactic configuration or not) and find that their models are unable to reliably assign higher probability to sentences in which NPIs are correctly licensed. The syntactic aspect of NPI licensing is also examined by , who demonstrate that LSTM LMs are susceptible to learning spurious licensing relationships, a finding that  demonstrate to also hold for BERT (Devlin et al., 2019).  investigate how explicit syntactic supervision of LMs affects their success with syntactic aspects of NPI licensing. The broad linguistic suites of  and Hu et al. (2020) also contain a set of tasks related to NPI licensing, demonstrating that it is one of the most challenging tasks for LMs to handle. Weber et al. (2021) investigated the dynamics of NPI learning during training, and connected this to a multi-task learning paradigm, demonstrating that LMs are able to efficiently leverage information from related licensing environments.
Lastly, Warstadt et al. (2019) examine BERT's ability in determining NPI acceptability. They demonstrate that BERT has significant knowledge of the dependency between NPIs and their licensors, but that this success varies widely across different experimental methods. Our study builds on that of Warstadt et al. (2019). Although they demonstrate that BERT is generally successful with NPI licensing, their results do not reveal whether BERT has constructed a more general category of DM expressions that is independent of collocational cues, nor whether it has understood that this category matters for NPI licensing.

Methods
Before getting to the main experimental part of our work, we briefly discuss the training corpus, model architecture and evaluation corpus we consider.

Training Corpus
The base training corpus we consider in our experiments is the corpus used by Gulordava et al. (2018). This corpus is a collection of sentences from Good and Featured English Wikipedia articles and consists of over 90M tokens. The vocabulary of the corpus consists of the 50.000 most frequent tokens in this corpus; less frequent tokens are mapped to a special <unk> token. We refer to the full training corpus type with the name Full, and to the LMs trained on this corpus as Full LMs. In addition to Full, we use multiple other corpora which are derived from Full by means of filtering. This will allow us to draw conclusions about specific generalization abilities and reliance on collocational cues of LMs; filtered corpora will be introduced in the relevant sections.
Model Architecture In our studies, we focus on recurrent language models. More specifically, following Gulordava et al. (2018), we consider twolayer LSTM language models, with an embedding and hidden size of 650. All training runs across our experiments follow the same regime, identical to the regime described by Gulordava et al. (2018): 40 epochs of training with SGD, with a plateau scheduler and an initial learning rate of 20, a batch size of 64, BPTT length of 35, and dropout of 0.1.

4
Evaluation Corpus To assess monotonicity and NPI licensing knowledge of LMs in our experiments, we leverage the NPI corpus of Warstadt et al. (2019), which consists of a large amount of grammatical and ungrammatical sentences with NPIs. This corpus is divided into 9 distinct environment classes, allowing for fine-grained analysis of NPI licensing. Importantly, these nine environment classes come in two versions: a DM version-in which NPIs are grammatically acceptable, and a minimally different UM version-in which they are not. We provide an overview with examples of DM and UM versions of all environment classes in Table 1. The full size of the corpus is 106.000 distinct DM sentences, and the division of environment classes is split roughly uniformly.

Experiments and Results
In this section we describe the experimental pipeline in more detail. A graphical overview of our experiments is depicted in Figures 1 and 4. Each experiment description is directly followed by an analysis of its results. For training and testing the DCs, we consider the hidden states at the position directly before an NPI occurs (see Figure 1). The reason we train the DCs at this position is because only at this point we are sure that the monotonicity information should surface and be encoded linearly. This is due to the fact that the decoder of the LM that transforms a hidden state into a probability distribution is linear as well: if the probability of some token depends on a linguistic feature, this feature must hence be encoded linearly. The DCs are implemented using the diagNNose library of Jumelet (2020), and trained using 10-fold cross-validation, Adam optimization (Kingma and Ba, 2015), a learning rate of 10 −2 and L1 regularization with λ = 0.005.
We train our monotonicity DCs in two separate ways. First, we divide the entire monotonicity corpus into a 90/10 train/test split, sampled uniformly across the different environment classes. This allows us to examine whether DM and UM environments are linearly separable in a way that is applicable to all environment classes. We refer to this classifier as the All-ENV DC.
Second, we move to a more fine-grained type  Figure 1: The pipeline of our experimental setup. We start by computing the hidden states h ↓ t (within a DM environment ahead of the NPI) and h ↑ t (within a UM environment). These hidden states are then used for training the monotonicity DC ( Exp. 1 & 4), and to compare P LM (NPI|h . The task of Experiments 3 and 5a can be found in Figure 4. Experiments 4 and 5 consist of the same tasks as the first three experiments, but differ in the language model that is used. of analysis. High performance of the All-ENV DC namely does not provide evidence that monotonicity is encoded the same way for each environment: the set of salient hidden units used by the All-ENV DC for classifying monotonicity within the Adverbs environment, for example, could be disjoint from the set of units used for the Only environment. To investigate this, we train a DC on the hidden states of all-but-one environment class, and test its performance on the excluded class. This provides a measure to what extent the monotonicity representation of DM and UM environments derived from all other environment classes generalizes to the held-out class, demonstrating stronger evidence that the model represent monotonicity in the same way across different environments.

Results
The results of our first experiment are shown in the top row of Figure 2. The first column contains the average accuracy for the All-ENV DC, and it can be seen that the diagnostic classifier succeeds in this task with high accuracy (97%). This indicates that the uniform split over all environment classes is linearly separable.
Next, we consider the held-out evaluation procedure for each of the nine environment classes. It can be seen that the monotonicity signal generalizes well to five classes (adverbs, determiner negation, only, sentential negation, and embedded questions), all with an accuracy above 90%. The other four classes yield a higher standard deviation, indicating that these classes are encoded less consistently across initialization seeds. The accuracy for all held-out DCs is lower compared to the All-ENV DC results, indicating that the All-ENV DC relied partly on information unrelated to a shared notion of monotonicity. The fact that the accuracy of these DCs is still so high, however, indicates that there is a substantial overlap between the way that monotonicity is encoded within the different environments.

Experiment 2: Do LMs predict the licensing conditions of NPIs?
In the next experiment we investigate the NPI acceptability judgments of the Full LMs on the corpus of Warstadt et al. (2019). This is done by comparing the probability of an NPI conditioned on the model's representation of a DM environment (h ↓ t ) and a UM environment (h ↑ t ), where success is defined as follows: This is a common evaluation procedure in the interpretability literature (Linzen et al., 2016), and has earlier been applied in the domain of NPI licensing by Jumelet and Hupkes (2018) and . Our approach is similar to the Cloze Test of Warstadt et al. (2019), but their setup used (bi-directional) masked LMs, making it possible to directly compare the probabilities of the NPI licensor, instead of comparing the NPI probabilities. Note that we purposefully do not base NPI acceptability on comparing full sentence probabilities: in our view this type of comparison can be distracted by token probabilities not related to the NPI itself. : Accuracy and standard deviation on the monotonicity diagnostic classification task, averaged over 5 seeds for each model type. The All-ENV column denotes train/test split procedure sampled uniformly over all environment class; other columns denote accuracy on one environment class that has been excluded during training.  Figure 3: Accuracy on the NPI acceptability task-based on whether the NPI was assigned a higher probability in the DM environment than in its UM counterpart.

NPI acceptability accuracy
We split this procedure out for each of the nine environment classes. The example sentence of the Simple Questions environment in Table 1, for example, is evaluated as follows: P LM (ever|Did the boy) > P LM (ever|The boy did) Using the full sentence probabilities for this comparison would require taking probabilities into account such as P LM (the|Did) and P LM (boy|T he), that have no relation to NPI licensing at all.

Results
We present the results for this experiment in the top row of Figure 3. The Full models demonstrate a considerable ability at predicting NPI acceptability, with the least performing class (SMP-Q, Simple Questions) yielding an accuracy that is still well above chance (0.72). Compared to earlier investigations on the ability of LSTM LMs in NPI licensing, our results indicate that these models are able to obtain a more sophisticated understanding of NPIs than previously thought: both Marvin and Linzen (2018) and Hu et al. (2020) report LSTM performance below chance on NPI acceptability tasks. This might in part be due to the different evaluation procedure we used (conditional vs. full-sentence probability comparison).

Experiment 3: Is the LM's knowledge of DM environments and of NPI licensing related?
We have now established that our models encode a signal related to monotonicity, and are successful at predicting NPI acceptability. In our third experiment, we assess to what extent the parameters used by the LM to predict NPIs (i.e. the LM's decoder embeddings for NPIs) overlap with the information the DCs use to predict the monotonicity properties of a particular environment class. For this we have devised a novel DC ranking method, that ranks the LM's decoder weights for all tokens based on their similarity with the DC weights.
We present a schematic overview of the method  Figure 5: Results on the median NPI rank task. A low median rank indicates that the monotonicity DC uses the same representational information as the NPI decoder.
in Figure 4. The LM's decoder weight matrix can be interpreted as a collection of vectors corresponding to each token in the model's vocabulary. The monotonicity DC is a binary classifier, so its weights are represented by a single vector. The LM's decoder vectors are of the same dimensionality as the weight vector of the monotonicity DC, which allows us to compute the similarity between each decoder vector and the monotonicity DC. For each of the 50.000 tokens in the LM's vocabulary, we calculate the cosine similarity between the decoder weights corresponding to that token and the DC's weights. We then sort these similarity scores, which results in a ranking of tokens that are most similar to the DC.
As we are interested in finding the connection of the monotonicity DC and the LM's NPI processing in general, we compute the median rank over a set of 11 NPIs. 5 A low median NPI rank indicates that the LM uses the same cues for NPI prediction as the monotonicity DC, demonstrating a clear connection between NPI licensing and monotonicity.
Contrary to Experiment 1, we no longer make use of the hold-one-out training procedure, that gave insights to what extent a general monotonicity signal generalizes to a held-out environment class. Instead, we train a separate diagnostic classifier for each environment class using a train/test split made up of DM and UM environments within that class. This results in a classifier that represents the class-specific decision boundary between minimal pairs of DM and UM items and allows us to investigate to what extent these decision boundaries align with the weights of the LM decoder. Next to the environment-specific DCs we also report the DC ranking outcome for the All-ENV DC that has been trained on all environments.

Results
The results of this experiment are presented in the top row of Figure 5. The first column (All-ENV) contains the result for the DC trained on all environment classes, and the median NPI rank of 9 demonstrates that the monotonicity DC aligns very closely with the NPI decoder weights of the LMs. This median rank should be interpreted within the context of the model vocabulary size: it can range upwards to 50.000, so a rank that is close to 0 signifies a tight connection between the probing task and the tokens of interest.
Moving on to the environment-specific results, it can be seen that the results vary considerably between the environment classes. The worst scoring class is again that of Simple Questions. This makes sense, as the licensing conditions for question constructions do not depend on the presence of a specific licensing token such as not, but on the overall structure of the whole sentence. The other environment classes lead to scores far closer to 0, indicating that for these classes monotonicity classification is closely aligned to NPI processing.
Interestingly, the median rank of the All-ENV DC is lower than the ranks of all other DCs. This shows that the model has aligned its representation of NPIs to an aggregate of the monotonicity representations in the different environment classes. This allows the model to flexibly deal with NPIs in a wide range of licensing environments.

Experiment 4: Are NPIs important for learning monotonicity information?
With Experiment 3 we established that NPI processing and monotonicity are related in our LMs. Now, we investigate to what extent their representations are entangled during training. More specifically, we investigate if the signal from the presence of NPIs is indispensable for the LM to develop a notion of monotonicity, or if instead the success in categorizing monotonicity environments can be learned independently of NPIs. We address this question by testing whether LMs can still classify the monotonicity properties of environments when they are completely deprived of NPIs during training. To do so, we train new language models on a modified corpus that does not contain any NPIs at all. To arrive at this corpus, we remove all sentences that contain at least one NPI expression from the Full corpus. We identify these expressions based on a comprehensive list of NPI expressions in English collected by Hoeksema (2012) and the list of NPIs in English compiled by Israel (2011). From this list, we manually removed expressions that have both NPI and non-NPI uses (e.g. a thing, a bit). The 40 NPI expressions that resulted from this procedure can be found in Appendix A. We train 5 models on this corpus and refer to them by the name Full\NPI.
In this experiment, we run the monotonicity probing procedure of Experiment 1 on the Full\NPI models. We posit that if the notion of monotonicity can be learned independently of NPIs, there should be no significant drop in performance compared to the results of the Full LMs.

Results
We report the results of this experiment in the bottom row of Figure 2. Again it can be noted that the All-ENV DC, trained and tested uniformly over all environment classes, obtains a high accuracy on the task (0.95). Furthermore, none of the held-out environment DCs lead to significant drops in performance compared to the Full LMs. Based on this we conclude that even in the absence of NPI cues, LMs are still able to build up a shared robust notion of monotonicity.

Experiment 5: How robust is the connection between monotonicity and NPI processing?
This research aims to uncover whether LMs possess a robust connection between monotonicity and NPI licensing. Our findings indicate that this connection is present in our models. A major confound that has not yet been addressed, however, is the extent to which our models rely on collocational cues when judging the acceptability of an NPI. To test this, we examine whether an LM's connection between NPIs and monotonicity generalizes to novel environment classes in which NPIs have never been encountered during the training phase of the LM.
We have created nine modified corpora in which sentences with NPIs within a specific environment have been removed. For these different corpora, we again consider the nine NPI-licensing environ-ments of Warstadt et al. (2019). For each environment class we create a new corpus by removing all sentences from the Full corpus in which an NPI expression from Appendix A is preceded by an expression belonging to that class, somewhere earlier in the sentence. 6 Note that we only remove the sentences in which the environment actually licenses an NPI; sentences in which the environment occurs without an NPI are retained. So for the adverbs environment, for example, we remove sentences like "Mary rarely ate any cookies" but not "Mary rarely ate cookies". For each of these nine corpora we train 3 new LMs. Models trained on these corpora are referred to by the name Full\ENV∩NPI.
We run the NPI acceptability task of Experiment 2 and the DC ranking method of Experiment 3 on the nine types of Full\ENV∩NPI models. A model with a robust connection between monotonicity and NPI processing should be able to learn for NPIs in the held-out environment that (i) the environment belongs to the class of environments in which NPIs are licensed, and that (ii) determining NPI acceptance should be done based on representational cues that are similar for monotonicity prediction.

Results
We report the results of this experiment next to the previous results of the Full model. First, we consider the NPI acceptability task, which is reported in the bottom row of Figure 3. Note that each cell in this row now corresponds to a specific model type: the ADV result, for instance, corresponds to the accuracy of the Full\ENV∩NPI models in which sentences with NPIs within adverbial environments have been removed. Our results show that the performance drops slightly for all environment classes, which can be attributed to a model's dependence on collocational cues. However, the models are still able to adequately generalize from the other environments, in which NPIs still are encountered, to the held-out environment. This demonstrates the semantic generalization capacities of the LM: it infers that the held-out environment in which NPIs have never been encountered shares some relevant properties with the other eight environment classes in which NPIs still occur.
The results for the DC ranking experiment are shown in the bottom row of Figure 5. Similar to the NPI acceptability results, the performance of the Full\ENV∩NPI models has dropped slightly compared to the Full models. However, if the models would no longer pick up on the connection between monotonicity in the held-out environment and NPI licensing at all, these median ranks should drop to chance, i.e. around the halfway mark of the vocabulary size (25.000). It can be seen that this is only the case for the Simple Questions environment, that was already performing poorly for the Full models. Based on this we conclude that although models depend partly on collocational cues for their connection between monotonicity and NPIs, they are still able to encode a robust connection that generalizes to novel DM environments.

Conclusion
Based on a series of experiments, we have established the following: (1) LMs categorize environments into DM and UM; (2) LMs are overall successful with NPI licensing; (3) LMs employ similar representational cues when processing NPIs and predicting monotonicity; (4) their categories of DM and UM environments can be learned independently of NPI occurrence; and (5) their connection between monotonicity and NPI processing is robust and not solely dependent on co-occurrence heuristics. This demonstrates that LMs have quite sophisticated knowledge of NPI licensing, which may be similar to that of humans and constitutes a vital step towards better understanding the linguistic generalization capacities of LMs.
These results raise the question: what do LMs learn about the DM and UM environments which they succeed in finding? Do they actually learn the inferential properties of those environments, or do they rely on some other property that DM environments have in common to categorize them as such? A direction for future work would be to develop methods to probe the inferential capacities of LMs and explore how they align with the DM and UM categories they construct.
Another direction for future work would be to incorporate the recent advancements on probingbased interpretability methods in our experimental pipeline (Hewitt and Liang, 2019;Voita and Titov, 2020). Our DC Ranking method aligns the performance of a probe with that of the language model itself, which is related to the approaches of Saphra and Lopez (2019), Elazar et al. (2021), andLovering et al. (2021). Placing our methodology more firmly in this body of work will allow for stronger conclusions to be drawn regarding the semantic knowledge of current language models.