AND does not mean OR: Using Formal Languages to Study Language Models’ Representations

A current open question in natural language processing is to what extent language models, which are trained with access only to the form of language, are able to capture the meaning of language. This question is challenging to answer in general, as there is no clear line between meaning and form, but rather meaning constrains form in consistent ways. The goal of this study is to offer insights into a narrower but critical subquestion: Under what conditions should we expect that meaning and form covary sufficiently, such that a language model with access only to form might nonetheless succeed in emulating meaning? Focusing on several formal languages (propositional logic and a set of programming languages), we generate training corpora using a variety of motivated constraints, and measure a distributional language model’s ability to differentiate logical symbols (AND, OR, and NOT). Our findings are largely negative: none of our simulated training corpora result in models which definitively differentiate meaningfully different symbols (e.g., AND vs. OR), suggesting a limitation to the types of semantic signals that current models are able to exploit.


Introduction
A current open question in natural language processing is to what extent language models (LMs; neural networks trained to predict the likelihood of word forms given textual context) are capable of truly understanding language. Bender and Koller (2020) argue that, since such models are trained exclusively on the form of language, they cannot possibly learn the meaning of language. We argue that the question of whether language models can learn meaning cannot be settled a priori. While language models only have direct access to form, linguistic form often correlates with meaning. The strength of the correlation varies across both different aspects of language and different tests of linguistic competence. While several intuitive tests of un-derstanding (e.g., demonstrating knowledge of the word dog by identifying pictures of dogs) are out of scope for LMs, many tasks which NLP aspires to solve (e.g., question answering, machine translation) operate entirely on natural language input and output. Thus, a relevant question is whether models which operate only on the forms of language can nonetheless learn to differentiate meanings.
Our goal is to focus on a tractable subproblem in order to improve our intuitions about the types of distributional signals that LMs can use to extract information relevant to meaning. We simulate a language modeling setup using propositional logic, in which we can naturally operationalize form to be strings of symbols in the language and meaning to be truth conditions. We define the semantic transparency of a text-only training corpus to be the degree to which an LM trained on that corpus learns to differentiate between aspects of form that affect truth conditions and aspects of form that do not. We have two primary research questions. First, what constraints on corpus generation produce greater semantic transparency? And second, are any such constraints sufficient for an LM to adequately differentiate meanings? 2 Experimental Design

Dataset Generation
We consider the form of a sentence to be simply the observed, syntactically-valid strings of characters and the meaning to be the truth conditions. Propositional logic is a simple language in which we can characterize both form and meaning. We use the grammar in Table 1, with standard semantics.
We focus our analysis on whether the representations of logical operators (∧, ∨, ¬) are influenced by distributional patterns that go beyond their superficial syntactic similarity evident in the grammar. That is, if a trained LM identifies that the meanings of ∧ 1 · · · ∧ k are identical to one another, and different from the meanings of ∨ 1 · · · ∨ l , we expect the embeddings for the ∧ i to be more similar to one another than they are to any of the ∨ i or the ¬ i . We consider a corpus to be semantically transparent if an LM trained on the corpus learns semanticallyclustered representations of the logical operators.
We generate four different training corpora, motivated by different assumptions one might make about how natural language corpora arise. These constraints are as follows, ordered roughly from weakest to strongest: 1. Syntactic Constraint. Speakers only generate sentences which are syntactically well-formed (that can be parsed by a syntactic parser). Here, this amounts to sampling from the grammar without additional constraints.
2. Truthfulness Constraint. Speakers of the language are constrained to generate sentences that are true in some context, i.e., that evaluate to True in at least one possible world. To implement this, we again sample from the grammar but additionally check with a satisfiability checker and omit sentences which are not satisfiable. E.g., (sym 1 ∧ (¬(sym 1 ))) would not appear.
3. Informativity Constraint. Speakers generate sentences not just to state true facts, but to provide listeners with information about a particular state of affairs. To simulate such a constraint, we randomly sample a set of "target worlds" T and a set of "alternative worlds" A such that T ∩ A = ∅. We then generate the shortest sentence s such that s is true in every world in T and s is false in every world in A. We experiment with several sizes of T and A, but report only on |T | = |A| = 2 as this provides the right balance of contextual diversity. See Appendix for additional discussion.
4. Explicit Grounding. We consider a setting in which speakers explicitly dictate the full state of affairs, without ambiguity. This is not intended as a realistic model of how corpora are generated, but rather to provide an upper bound on semantic transparency by giving models a corpus in which form is perfectly correlated with meaning. We generate this corpus in the same way as the Truthfulness corpus, but append an explicit marker of the truth values 1 of the variables in the sentence, e.g.: (sym 1 ∧ (¬(sym 2 ))) <sep> sym 1 T sym 2 F. Sampling Parameters. Each dataset consists of 100K training and 1K validation sentences. We set the number of non-reserved symbols (N in the above grammar) to 5,000, and the number of "synonyms" of each logical symbol (K,L,M) to be 5. Thus, a sentence in one of our datasets might look like (sym 1 ∧ 3 (¬ 4 (sym 85 ))), and would be true if and only if sym 1 is true and sym 85 is false 2 .
We generate sentences using a probabilistic context-free grammar with the rules shown above. The tree depth d of a generated sentence is controlled by a parameter γ such that P (d|d−1) = γ d . The number of unique variables in a sentence 3 is sampled from a non-zero Poisson distribution parameterized by λ. We set λ = 2 and γ = .85 in the reported experiments, but don't find parameter choice affects our conclusions. Note that the Informativity dataset is generated deterministically, and thus sampling parameters do not apply and sentences in that dataset are shorter. Dataset statistics and data generation parameter sensitivity are in the Appendix.

Models and Training
We consider LSTM and Transformer LMs of differing sizes, shown in Table 2. Each model is trained on one of the above four datasets until convergence on the associated validation set using early stopping with a patience of 15 epochs. The LMs were implemented in PyTorch (Paszke et al., 2019) and took roughly 5 hours to converge on TitanV, Ti-tanRTX, and QuadroRTX GPUs 4 . We randomly initialize the embedding layer. Hyperparameter details can be found in the Appendix. We train 5 random restarts of each setting. Due to the regular nature of our synthetic data, we found larger mod-  Table 2: Summary of language modeling performance. For each model, on each training dataset, we report PPL / %Syn / %Sem where PPL is the perplexity on heldout data (drawn from the same distribution as the training corpus), %Syn is the percentage of generated sentences that are syntactically well formed (i.e., parseable), estimated on a set of 1,000 generations sampled from the trained model, and % Sem is the percentage of generated sentences that are semantically well formed (i.e., satisfiable), estimated on the same set of 1,000. els overfit the training data quickly, and thus focus on smaller models.

Results and Discussion
Language Modeling Performance. We first sanity check that the trained models indeed function as LMs before evaluating the lexical representations. We compute the models' perplexity on heldout data. However, since perplexity is not comparable across conditions (since each constraint leads to differently distributed corpora) we also sample 1,000 generated sentences from each model and compare by measuring whether the sentences are 1) syntactically well-formed (i.e., parseable) and 2) semantically well-formed (i.e., satisfiable). Even in the case of models trained with the Syntactic constraint, as seen in Table 2, most of the sentences produced are nonetheless satisfiable. We see no difference between the Syntactic, Truthfulness, and Explicit Grounding conditions on these metrics. (The Informativity numbers are likely higher due to the shorter sentences that result from that generative process.) The fact that models trained only on satisfiable sentences nonetheless generate sentences which do not abide by such constraints suggests the models fail to encode less overt distributional patterns, which depend, for example, on recognizing abstract relations such as "sameness" of symbols in order to recognize violations (e.g., (A ∧(¬ A)). The failure to capture such properties of the data even in this simplified setting might have negative implications for the models' ability to infer abstract semantic relationships from more complex natural language corpora.
Representations of Logical Symbols. Again, our first question is: What constraints on corpus generation yield the greatest amounts of semantic transparency? We quantify this by measuring how well the embeddings learned by the trained LMs correspond to our truth-theoretic notions of semantic equivalence: e.g., are ∧ 1 and ∧ 2 more similar to one another than ∧ 1 and ∨ 1 ? We use a nearest neighbors probing classifier to evaluate whether models distinguish the operators at the lexical level. We run k-fold cross validation, in each iteration choosing one symbol per class (i.e., one ∧, one ∨, one ¬) as the class exemplars, and then classifying the remaining points using cosine similarity. We set k to 125, so that we observe every symbol combination as exemplars. We report accuracy averaged across folds and random restarts.
Probing classifier results are shown in Figure 1. Figure 2 shows an embedding visualization for one model (Medium Transformer). We find that training on the Syntactic and on the Explicit Grounding dataset leads to the least and the most distinguishable operators respectively for all models, and the other conditions end up between these values.
These results address our first question: there is some difference in semantic transparency between differently constrained datasets. Interestingly, the Transformer models perform better in the Truthfulness condition than in the Syntactic condition, which the LSTMs fail to differentiate. This suggests that, even if it does not necessarily manifest in the models' generations (Table 2), the Transformer architecture may nonetheless be capable of picking up on some of the more abstract distributional patterns via which syntax and semantics are correlated. Further work on larger models would be required to explore this in depth.
In addition, we observe little difference between the quality of the representations learned in the Informativity condition and those learned in the Truthfulness condition; one exception might be in the Medium LSTM, though we cannot confirm that this difference is robustly reproducible. Thus, based on our experiments, there is no evidence that Informativity alone yields greater semantic transparency. However, we note that the experimental setup for Informativity is not directly comparable to the others (e.g., sentences are shorter and less diverse than in Truthfulness) and thus further study would be needed to make strong claims, positive or negative.
Finally, we note that in nearly all cases, models are able to differentiate ¬ from the other operators, likely because it is a unary operator and thus syntactically different from the binary operators. Thus the difference in accuracy is almost entirely due to whether the representations of ∧ and ∨ are differentiated (as shown in Figure 2). This gives a negative answer to our second question concerning whether any constraints are sufficient for an LM to adequately differentiate meaning. Apart from the Small Transformer on the Explicit Grounding condition, none of the models can completely distinguish between symbols that are similar in form but different in meaning.

Related Work
It is an open question whether neural models can learn abstract functions (Marcus, 2001). Our work builds upon a large body of research intended to probe which aspects of language and meaning are being captured by large LMs. Most closely related is work that assesses whether models can perform symbolic reasoning about language (Kassner et al., 2020) e.g., quantifiers or negation (Talmor et al., 2020;Ettinger, 2020;Kassner and Schütze, 2020;Wang et al., 2018) or by measuring the systematicity of models' inferences (Goodwin et al., 2020; Kim and Linzen, 2020; Yanaka et al., 2020;Warstadt et al., 2019). Such work has tended to find that LMs reason primarily contextually as opposed to abstractly. Our evaluation method-which asks whether word embeddings cluster according to their truth-conditional meaning-is related to recent work which defines text-only models as "grounded" if the learned embedding space is isomorphic to the similarity function defined over a ground-truth meaning representation (Merrill et al., 2021). More distantly related is work on LMs' ability to reason about numbers (Wallace et al., 2019) or perform multi-hop reasoning (Yang et al., 2018). Prior work that examines neural networks' ability to perform logical reasoning is superficially related (Evans et al., 2018). In this way, our work builds on past work that uses synthetic rather than natural language datasets in order to probe model behavior in the absence of confounds. Notable examples are SCAN for measuring compositionality and generalization (Lake and Baroni, 2018) and Kassner et al. (2020) which investigates LM knowledge acquisition and fact memorization using a synthetic dataset of entity-relation tuples.

Conclusion
Using propositional logic corpora to simulate a controlled language modeling setting, we ask: 1) Do properties of the training corpus affect LMs' abilities to differentiate the meanings of logical operators? and 2) Do any training corpora lead to models that differentiate these meanings to a satisfactory degree? Our results imply a positive answer to (1): Models trained on corpora generated with different constraints appear to perform differently at the task of separating ∧ from ∨. However, these differences are a function of both data and model. For example, the Transformer architecture seems better able to learn from weaker signal (corpora generated only with a Truthfulness constraint), while LSTMs require more explicit signal (direct access to truth values). On question (2), our results are largely negative for the syntactically similar operators. Even the most semantically transparent training data did not enable models to separate the representations of symbols with similar form but different meaning. Only the Small Transformer trained on the Explicit Grounding condition can perfectly differentiate ∧ from ∨ at the lexical level, despite the task's controlled nature. However, every model did separate ¬ from both ∧ and ∨, illustrating how syntactic differences can support differentiation of meaning.
Overall, we contribute a novel framework, based on syntax and semantics of propositional logic, via which we can explore questions of the linguistic capabilities and weaknesses of neural LMs. Our experiments represent a first step in this line of work, but further work is needed to fully appreciate the implications of these results in natural language settings, in particular, how closely the constraints explored here mirror real corpora, and how such learning is influenced by noise and ambiguity found in human language. One specific limitation of our experiments is that we constrain our analysis to the lexical representations-i.e., we assume that differences between the meanings of ∧ and ∨ should be encoded in the lexicon, via contextinvariant type embeddings. While this assumption is commonplace in formal semantics, neural LMs open the possibility of alternative representations of lexical and compositional semantics. Our results do not rule out the possibility that the relevant semantic distinctions are encoded elsewhere in the model, above the lexical layer. However, we take the combination of the lexical probing results and LM generation results as suggestive but not con-firmational evidence of a more general negative finding. There are several parameters involved in the creation of our synthetic propositional logic datasets: • Number of sentences in the training set • Number of unique non-reserved variables (N) • Number of each operator (K, L, M) • Sentence depth parameter (γ) • Poisson distribution parameter for unique nonreserved variables in sentence (λ) In comparison to dataset sizes for large language models in modern natural language processing, the dataset size (100k training examples) and vocabulary size (5k symbols + 5 of each operator) of our main experimental results (Figure 1) are rather small. We sought to determine whether our choice for dataset size and non-restricted variable count greatly changed the final results-do our conclusions change based on these parameters? We trained models on different variations of our initial parameters.
First, we swept across training set sizes (20k, 100k, and 500k examples) and number of symbols (500, 5k, 50k) while holding all other parameters constant (γ = .85, λ = 2, K, L, M = 5). We used the Medium Transformer model, which performed the best across our four models, and observed the results of the probing classifier on the embeddings after training separately on each model.
The results of the above sweep are shown in Figure 3. We do not find that the models perform dramatically differently on any of the datasets when dataset size and number of non-reserved symbols are varied.
We also experimented with changing the number of operator synonyms (e.g. ∧ 1 , ∧ 2 , ...∧ K ) We experimented with three different sizes-(K, L, M) = 5, 25, 100-for each of our 4 datasets. Those results are shown in Figure 5, and average frequency is shown in Table 3. We found that adding additional synonyms of each operator hurt performancelikely because adding additional synonyms of ∧ and ∨ made generalization more challenging, causing the models' performance to drop.
In a set of earlier experiments, to choose the sentence depth (γ) and Poisson distribution (λ) parameters, we hyperparameter searched on the Explicit Grounding condition across three values of  each (nine datasets in total). Specifically, we tested λ = 2, 3, 5 and γ = .7, .8, .85. We then trained the transformer model once on each of the nine datasets, and the results are shown in Figure 6. We chose λ = 2 and γ = .85.

Informativity dataset information
We tested different settings of |T | (number of target worlds) and |A| (number of alternative worlds).
The shortest sentence would then be sym 1 , as it sufficiently distinguishes T from A. However, with |T | = 1, |A| = 2, we might generate T = (sym 1 = T, sym 2 = F), A = ((sym 1 = F, sym 2 = F), (sym 1 = T, sym 2 = T)). Now the shortest sentence that can be generated is (sym 1 ∧ 1 ¬ 1 (sym 2 )). |T | = 1, |A| = 2 and |T | = 2, |A| = 1 result in sentences that are both short and structurally nearly identical, although inverted. This is due to the truth conditions allowed by each operator. We generate the datasets for each combination and report the results in Table 4. We excluded these datasets because of the simplicity and similarity of the sentences. We found that |T | = 2, |A| = 2 allows for sentences that are much more varied.