SyGNS: A Systematic Generalization Testbed Based on Natural Language Semantics

Recently, deep neural networks (DNNs) have achieved great success in semantically challenging NLP tasks, yet it remains unclear whether DNN models can capture compositional meanings, those aspects of meaning that have been long studied in formal semantics. To investigate this issue, we propose a Systematic Generalization testbed based on Natural language Semantics (SyGNS), whose challenge is to map natural language sentences to multiple forms of scoped meaning representations, designed to account for various semantic phenomena. Using SyGNS, we test whether neural networks can systematically parse sentences involving novel combinations of logical expressions such as quantifiers and negation. Experiments show that Transformer and GRU models can generalize to unseen combinations of quantifiers, negations, and modifiers that are similar to given training instances in form, but not to the others. We also find that the generalization performance to unseen combinations is better when the form of meaning representations is simpler. The data and code for SyGNS are publicly available at https://github.com/verypluming/SyGNS.


Introduction
Deep neural networks (DNNs) have shown impressive performance in various language understanding tasks (Wang et al., 2019a,b, i.a.), including semantically challenging tasks such as Natural Language Inference (NLI; Dagan et al., 2013;Bowman et al., 2015). However, a number of studies to probe DNN models with various NLI datasets (Naik et al., 2018;Dasgupta et al., 2018;Yanaka et al., 2019;Kim et al., 2019;Richardson et al., 2020;Saha et al., 2020;Geiger et al., 2020) have reported that current DNN models have some limitations to generalize to diverse semantic phenomena, and it is still not clear whether DNN models obtain the abil-

Training Sentences
One wild dog ran All dogs ran One dog did not run

Generalization Test
All wild dogs ran  Figure 1: Illustration of our evaluation protocol using SyGNS. The goal is to map English sentences to meaning representations. The generalization test evaluates novel combinations of operations (modifier, quantifier, negation) in the training set. We use multiple meaning representations and evaluation methods. ity to capture compositional aspects of meaning in natural language.
There are two issues to consider here. First, recent analyses (Talmor and Berant, 2019;Liu et al., 2019;McCoy et al., 2019) have pointed out that the standard paradigm for evaluation, where a test set is drawn from the same distribution as the training set, does not always indicate that the model has obtained the intended generalization ability for language understanding. Second, the NLI task of predicting the relationship between a premise sentence and an associated hypothesis without asking their semantic interpretation tends to be black-box, in that it is often difficult to isolate the reasons why models make incorrect predictions (Bos, 2008).
To address these issues, we propose SyGNS (pronounced as signs), a Systematic Generalization testbed based on Natural language Semantics. The goal is to map English sentences to various meaning representations, so it can be taken as a sequence-to-sequence semantic parsing task. Figure 1 illustrates our evaluation protocol using SyGNS. To address the first issue above, we probe the generalization capability of DNN models on two out-of-distribution tests: systematicity (Section 3.1) and productivity (Section 3.2), two concepts treated as hallmarks of human cognitive capacities in cognitive sciences (Fodor and Pylyshyn, 1988;Calvo and Symons, 2014). We use a traintest split controlled by each target concept and train models with a minimally sized training set (Basic set) involving primitive patterns composed of semantic phenomena such as quantifiers, modifiers, and negation. If a model learns different properties of each semantic phenomenon from the Basic set, it should be able to parse a sentence with novel combination patterns. Otherwise, a model has to memorize an exponential number of combinations of linguistic expressions.
To address the second issue, we use multiple forms of meaning representations developed in formal semantics (Montague, 1973;Heim and Kratzer, 1998;Jacobson, 2014) and their respective evaluation methods. We use three scoped meaning representation forms, each of which preserves the same semantic information (Section 3.3). In formal semantics, it is generally assumed that scoped meaning representations are standard forms for handling diverse semantic phenomena such as quantification and negation. Scoped meaning representations also enable us to evaluate the compositional generalization ability of the models to capture semantic phenomena in a more fine-grained way. By decomposing an output meaning representation into constituents (e.g., words) in accordance with its structure, we can compute the matching ratio between the output representation and the gold standard representation. Evaluating the models on multiple meaning representation forms also allows us to explore whether the performance depends on the complexity of the representation forms.
This paper provides three main contributions. First, we develop the SyGNS testbed to test model ability to systematically transform sentences involving linguistic phenomena into multiple forms of scoped meaning representations. The data and code for SyGNS are publicly available at https: //github.com/verypluming/SyGNS. Second, we use SyGNS to analyze the systematic generalization capacity of two standard DNN models: Gated Recurrent Unit (GRU) and Transformer. Experiments show that these models can generalize to unseen combinations of quantifiers, negations, and modifiers to some extent. However, the generalization ability is limited to the combinations whose forms are similar to those of the training instances. In addition, the models struggle with parsing sentences involving nested clauses. We also show that the extent of generalization depends on the choice of primitive patterns and representation forms.

Related Work
The question of whether neural networks obtain the systematic generalization capacity has long been discussed (Fodor and Pylyshyn, 1988;Marcus, 2003;Baroni, 2020). Recently, empirical studies using NLI tasks have revisited this question, showing that current models learn undesired biases (Glockner et al., 2018;Poliak et al., 2018;Tsuchiya, 2018;Geva et al., 2019;Liu et al., 2019) and heuristics (McCoy et al., 2019), and fail to consistently learn various inference types (Rozen et al., 2019;Nie et al., 2019;Yanaka et al., 2019;Richardson et al., 2020;Joshi et al., 2020). In particular, previous works (Goodwin et al., 2020;Yanaka et al., 2020;Geiger et al., 2020;Yanaka et al., 2021) have examined whether models learn the systematicity of NLI on monotonicity and veridicality. While this line of work has shown certain limitations of model generalization capacity, it is often difficult to figure out why the NLI model fails and how to improve it, partly because NLI tasks depend on multiple factors, including semantic interpretation of target phenomena and acquisition of background knowledge. By focusing on semantic parsing rather than NLI, one can probe to what extent models systematically interpret the meaning of sentences according to their structures and the meanings of their constituents.
Meanwhile, datasets for analysing the compositional generalization ability of DNN models in semantic parsing have been proposed, including SCAN (Lake and Baroni, 2017; Baroni, 2020), CLUTRR (Sinha et al., 2019), andCFQ (Keysers et al., 2020). For example, the SCAN task is to investigate whether models trained with a set of primitive instructions (e.g., jump → JUMP) and modifiers (e.g., walk twice → WALK WALK) generalize to new combinations of primitives (e.g., jump twice → JUMP JUMP). However, these datasets deal with artificial languages, where the variation of linguistic expressions is limited, so it is not clear to what extent the models systematically interpret various semantic phenomena in natural language, such as quantification and negation.
Regarding the generalization capacity of DNN models in natural language, previous studies have focused on syntactic and morphological generalization capacities such as subject-verb agreement tasks (Linzen et al., 2016;Gulordava et al., 2018;Marvin and Linzen, 2018, i.a.). Perhaps closest to our work is the COGS task (Kim and Linzen, 2020) for probing the generalization capacity of semantic parsing in a synthetic natural language fragment. For instance, the task is to see whether models trained to parse sentences where some lexical items only appear in subject position (e.g., John ate the meat) can generalize to structurally related sentences where these items appear in object position (e.g., The kid liked John). In contrast to this work, our focus is more on semantic parsing of sentences with logical and semantic phenomena that require scoped meaning representations. Our study also improves previous work on the compositional generalization capacity in semantic parsing in that we compare three types of meaning representations and evaluate them at multiple levels, including logical entailment, polarity assignment, and partial clause matching (Section 3.3).

Overview of SyGNS
We use two evaluation concepts to assess the systematic capability of models: systematicity (Section 3.1) and productivity (Section 3.2). In evaluating these two concepts, we use synthesized pairs of sentences and their meaning representations to control a train-test split (Section 3.4). The main idea is to analyze models trained with a minimum size of a training set (Basic set) involving primitive patterns composed of various semantic phenomena; if a model systematically learns primitive combination patterns in the Basic set, it should parse a new sentence with different combination patterns. We target three types of scoped meaning representations and use their respective evaluation methods, according to the function and structure of each representation form (Section 3.3). Table 1 illustrates how we test systematicity, i.e., the capacity to interpret novel combinations of primitive semantic phenomena. We generate Basic set 1 by combining various quantifiers with sentences without modifiers. We also generate Basic Pattern Sentence Train Primitive quantifier One tiger ran Basic 1 EXI A tiger ran NUM Two tigers ran UNI Every tiger ran Basic 2 ADJ One small tiger ran ADV

Systematicity
One tiger ran quickly CON One tiger ran or came

Test EXI+ADJ
A small tiger ran NUM+ADV Two tigers ran quickly UNI+CON Every tiger ran or came set 2 by setting an arbitrary quantifier (e.g., one) to a primitive quantifier and combining it with various types of modifiers. We then evaluate whether models trained with Basic sets 1 and 2 can parse sentences involving unseen combinations of quantifiers and modifiers. We also test the combination of quantifiers and negation in the same manner; the detail is given in Appendix D.
To provide a controlled setup, we use three quantifier types: existential quantifiers (EXI), numerals (NUM), and universal quantifiers (UNI). Each type has two patterns: one and a for EXI, two and three for NUM, and all and every for UNI. We consider three settings where the primitive quantifier is set to the type EXI, NUM, or UNI.
For modifiers, we distinguish three types -adjectives (ADJ), adverbs (ADV), and logical connectives (CON; conjunction and, disjunction or)and ten patterns for each. Note that each modifier type differs in its position; an adjective appears inside a noun phrase (e.g., one small tiger), while an adverb (e.g., quickly) and a coordinated phrase with a logical connective (e.g., or came) appears at the end of a sentence. Although Table 1 only shows the pattern generated by the primitive quantifier one and the noun tiger, the noun can be replaced with ten other nouns (e.g., dog, cat, etc.) in each setting. See Appendix A for more details on the fragment of English considered here.

Productivity
Productivity refers to the capacity to interpret an indefinite number of sentences with recursive operations. To test productivity, we use embedded relative clauses, which interact with quantifiers to generate logically complex sentences.  shows examples. We provide two Basic sets; Basic set 1 consists of sentences without embedded clauses (NON) and Basic set 2 consists of sentences with a single embedded clause, which we call sentences with depth one. We then test whether models trained with Basic sets 1 and 2 can parse a sentence involving deeper embedded clauses, i.e., sentences whose depth is two or more. As Table 2 shows, we consider both peripheral-embedding (PER) and center-embedding (CEN) clauses.

Meaning representation and evaluation
Overview To evaluate generalization capacity in semantic parsing at multiple levels, we use three types of scoped meaning representations: (i) First-Order Logic (FOL) formulas, (ii) Discourse Representation Structures (DRSs; Kamp and Reyle, 1993), and (iii) Variable-Free (VF) formulas (Baader et al., 2003;Prat-Hartmann and Moss, 2009). DRSs can be converted to clausal forms (van Noord et al., 2018a) for evaluation. For instance, the sentence (1) is mapped to the FOL formula in (2), the DRS in (3a), its clausal form in (3b), and the VF formula in (4).
(1) One white dog did not run.
(4) EXIST AND WHITE DOG NOT RUN Using these multiple forms enables us to analyze whether the difficulty in semantic generalization depends on the format of meaning representations.
Previous studies for probing generalization capacity in semantic parsing (e.g., Lake and Baroni, 2017;Sinha et al., 2019;Keysers et al., 2020;Kim and Linzen, 2020) use a fixed type of meaning representation, with its evaluation method limited to the exact-match percentage, where an output is considered correct only if it exactly matches the gold standard. However, this does not properly assess whether models capture the structure and function of meaning representation. First, exact matching does not directly take into account whether two meanings are logically equivalent (Blackburn and Bos, 2005): for instance, schematically two formulas A ∧ B and B ∧ A are different in form but have the same meaning. Relatedly, scoped meaning representations for natural languages can be made complex by including parentheses and variable renaming mechanism (the so-called α-conversion in λ-calculus). For instance, we want to identify two formulas which only differ in variable naming, e.g., ∃x 1 .F (x 1 ) and ∃x 2 .F (x 2 ). It is desirable to compare exact matching with alternative evaluation methods, and to consider alternative meaning representations that avoid these problems. Having this background in mind, below we will describe each type of meaning representation in detail.
FOL formula FOL formulas are standard forms in formal and computational semantics (Blackburn and Bos, 2005;Jurafsky and Martin, 2009), where content words such as nouns and verbs are represented as predicates, and function words such as quantifiers, negation, and connectives are represented as logical operators with scope relations (cf. the example in (2)). To address the issue on evaluation, we consider two ways of evaluating FOL formulas in addition to exact matching: (i) automated theorem proving (ATP) and (ii) monotonicity-based polarity assignment.
First, FOL formulas can be evaluated by checking the logical entailment relationships that directly consider whether two formulas are logically equivalent. Thus we evaluate predicted FOL formulas by using ATP. We check whether a gold formula G entails prediction P and vice versa, using an off-the-shelf FOL theorem prover 1 . To see the logical relationship between G and P , we measure the accuracy for unidirectional and bidirectional entailment: G ⇒ P , G ⇐ P , and G ⇔ P .
Second, the polarity of each content word appear- ing in a sentence can be extracted from the FOL formula using its monotonicity property (van Benthem, 1986;MacCartney and Manning, 2007). This enables us to analyze whether models can correctly capture entailment relations triggered by quantifier and negation scopes. Table 3 shows some examples of monotonicity-based polarity assignments. For example, existential quantifiers such as one are upward monotone (shown as ↑) with respect to the subject NP and the VP, because they can be substituted with their hypernyms (e.g., One dog ran ⇒ One animal moved). These polarities can be extracted from the FOL formula because ∃ and ∧ are upward monotone operators in FOL. Universal quantifiers such as all are downward monotone (shown as ↓) with respect to the subject NP and upward monotone with respect to the VP. Expressions in downward monotone position can be substituted with their hyponymous expressions (e.g., All dogs ran ⇒ All white dogs ran). The polarity can be reversed by embedding another downward entailing context like negation, so the polarity of run in the third case in Table 3 is flipped to downward monotone. 2 For evaluation based on monotonicity, we extract a polarity for each content word in a gold formula and a prediction and calculate the F-score for each monotonicity direction (upward and downward).
DRS A DRS is a form of scoped meaning representations proposed in Discourse Representation Theory, a well-studied formalism in formal semantics (Kamp and Reyle, 1993;Asher, 1993;Muskens, 1996). By translating a box notation as in (3a) to the clausal form as in (3b), one can evaluate DRSs by COUNTER 3 , which is a standard tool for evaluating neural DRS parsers (Liu et al., 2018;van Noord et al., 2018b). COUNTER searches for the best variable mapping between predicted DRS clauses and gold DRS clauses and calculates an F-score over matching clause, which is similar to SMATCH (Cai and Knight, 2013), an evaluation metric designed for Abstract Meaning Representation (AMR; Banarescu et al., 2013). COUNTER alleviates the process of variable renaming and correctly evaluates the cases where the order of clauses is different from that of gold answers.
VF formula FOL formulas in our fragment have logically equivalent forms in a variable-free format, which does not contain parentheses nor variables as in the example (4). Our format is similar to a variable-free form in Description Logic (Baader et al., 2003) and Natural Logic (Prat-Hartmann and Moss, 2009). VF formulas alleviate the problem of parentheses and variable renaming, while preserving semantic information (cf. Wang et al., 2017). Due to the equivalence with FOL formulas, it is possible to extract polarities from VF formulas. See Appendix A for more examples of VF formulas.

Data generation
To provide synthesized data, we generate sentences using a context-free grammar (CFG) associated with semantic composition rules in the standard λ-calculus (see Appendix A for details). Each sentence is mapped to an FOL formula and VF formula by using the semantic composition rules specified in the CFG. DRSs are converted from the generated FOL formulas using the standard mapping (Kamp and Reyle, 1993

Experiments and Analysis
Using SyGNS, we test the performance of Gated Recurrent Unit (GRU; Cho et al., 2014) and Transformer (Vaswani et al., 2017) in an encoder-decoder setup. These are widely used models to perform well on hierarchical generalization tasks (McCoy et al., 2018;Russin et al., 2020).   Table 4: Accuracy by quantifier type. "DRS (cnt)" columns show the accuracy of predicted DRSs by COUNTER, and "Valid" rows show the validation accuracy. Each accuracy is measured by exact matching, except for "DRS (cnt)" columns.

Experimental setup
In all experiments, we trained each model for 25 epochs with early stopping (patience = 3). We performed five runs and reported their average accuracies. The input sentence is represented as a sequence of words, using spaces as a separator. The maximum input and output sequence length is set to the length of a sequence with maximum depths of embedded clauses. We set the dropout probability to 0.1 on the output and used a batch size of 128 and an embedding size of 256. Since incorporating pre-training would make it hard to distinguish whether the models' ability to perform semantic parsing comes from training data or from pre-training data, we did not use any pre-training. For the GRU, we used a single-layer encoderdecoder with global attention and a dot-product score function. Since a previous work (Kim and Linzen, 2020) reported that unidirectional models are more robust regarding sentence structures than bi-directional models, we selected a unidirectional GRU encoder. For the Transformer, we used a three-layer encoder-decoder, a model size of 512, and a hidden size of 256. The number of model parameters was 10M, respectively. See Appendix B for additional training details.

Results on systematicity
Generalization on quantifiers Table 4 shows the accuracy by quantifier type. When the existential quantifier one was the primitive quantifier, the accuracy on the problems involving existential quantifiers, which have the same type as the primitive quantifier, was nearly perfect. Similarly, when the universal quantifier every was the primitive quantifier, the accuracy on the problems involving universal quantifiers was much better than that on the problems involving other quantifier types. These results indicate that models can easily generalize to problems involving quantifiers of the same type as the primitive quantifier, while the models struggle with generalizing to the others. We also experimented with larger models and observed the same trend (see Appendix C). The extent of generalization varies according to the primitive quantifier type and meaning representation forms. For example, when the primitive quantifier is the numeral expression two, models generalize to problems of VF formulas involving universal quantifiers. This can be explained by the fact that VF formulas involving universal quantifiers like (5b) have a similar form to those involving numerals as in (6b), whereas FOL formulas involving universal quantifiers have a different form from those involving numerals as in (5c)  We also check the performance when three quantifiers one, two, and every are set as primitive quantifiers. This setting is easier than that for the systematicity in Table 1, since the models are exposed to combination patterns of all the quantifier types and all the modifier types. In this setting, the models achieved almost perfect performance on the test set involving non-primitive quantifiers (a, three, all). Table 5 shows the accuracy by modifier type where one is set to the primitive quantifier (see Appendix C for the results where other quantifier types are set to the primitive quantifier). No matter which quantifier is set as the primitive quantifier, the accuracy for problems involving logical connectives or adverbs is better than those involving adjectives. As in (8), an adjective is placed between a quantifier and a noun, so the position of the noun dog with respect to the quantifier every in the test set changes from the example in the training (Basic) set in (7). In contrast, adverbs and logical connectives are placed at the end of a sentence, so the position of the noun does not change from the training set, as in (9). This suggests that models can more easily generalize in problems involving unseen combinations of quantifiers and modifiers where the position of the noun is the same between the training and test sets.  Meaning representation comparison Comparing forms of meaning representations, accuracy by exact matching is highest in the order of VF formulas, DRS clausal forms, and FOL formulas. This indicates that models can more easily generalize to unseen combinations where the form of meaning representation is simpler; VF formulas do not contain parentheses nor variables, DRS clausal forms contain variables but not parentheses, and FOL formulas contain both parentheses and variables.

Model comparison
Regarding the generalization capacity of models for decoding meaning representations, the left two figures in Figure 2 show learning curves on FOL prediction tasks by quantifier type. While GRU achieved perfect performance on the same quantifier type as the primitive quantifier, where the number of training data is 2,500, Transformer achieved the same performance when the number of training data is 8,000. The right two figures in Figure 2 show learning curves by modifier type. The GRU accuracy is unstable even when the number of training examples is maximal. In contrast, the Transformer accuracy is stable when the number of training data exceeds 8,000. These results indicate that GRU generalizes to unseen combinations of quantifiers and modifiers with a smaller training set than can Transformer, while the Transformer performance is more stable than that of GRU. Table 6 shows the ATPbased evaluation results on FOL formulas. For combinations involving numerals, both GRU and Transformer achieve high accuracies for G ⇒ P entailments but low accuracies for G ⇐ P entailments. Since both models fail to output the formulas corresponding to modifiers, they fail to prove G ⇐ P entailments. Regarding combinations involving universal quantifiers, the GRU accuracy for both G ⇒ P and G ⇐ P entailments is low, and the Transformer accuracy for G ⇐ P entailments is much higher than that for G ⇒ P entailments. Table 7, GRU tends to fail to output the formula for a modifier, e.g., wild(x) in this case, while Transformer fails to correctly output the position of the implication (→). The ATP-based evaluation results reflect such  Table 5: Accuracy by modifier type (primitive quantifier: existential quantifier one). +NEG indicates problems involving negation. Each accuracy is measured by exact matching, except for "DRS (cnt)" columns. Test GRU Transformer    differences between error trends of models in problems involving different forms of quantifiers. Table 8 shows accuracies for the monotonicity-based polarity assignment evaluation on FOL formulas. The accuracies were higher than those using exact matching (cf. Table 4). Monotonicity-based evaluation captures the polarities assigned to content words even for the problems that exact-matching judges as incorrect because of the differences in form. Table 7 shows examples of predicted polarity assignments. Here both models predicted correct polarities for three content words, cat ↓ , escape ↑ , run ↑ . Exact-matching cannot take into account such partial matching. The downward monotone accuracies for problems involving universal quantifiers are low (40.7 and 73.4 in Table 8). In Table 7, both models failed to predict the downward monotonicity of wild ↓ . The results indicate that both models struggle with capturing the scope of universal quantifiers. Appendix C shows the evaluation results on the polarities of VF formulas. Table 9 shows very low generalization accuracy for both GRU and Transformer at unseen depths. Although the evaluation results using COUNTER on DRS prediction tasks is much higher than those by exact matching, this is due to the fact that COUNTER uses partial matching; both models tended to correctly predict the clauses in the subject NP that are positioned at the beginning of the sentence (see Appendix E for details). We checked whether models can generalize to unseen combinations involving embedded clauses when the models are exposed to a part of training instances at each depth. We provide Basic set 1  Table 9: Accuracy for productivity. "Dep" rows show embedding depths, "DRS (cnt)" columns show accuracy of predicted DRSs by COUNTER, and "Valid" row shows the validation accuracy. Each accuracy is measured by exact matching, except for "DRS (cnt)" columns.  involving non-embedding patterns like (10), where Q can be replaced with any quantifier. This Basic set 1 exposes models to all quantifier patterns. We also expose models to Basic set 2 involving three primitive quantifiers (one, two, and every) at each embedding depth, like (11) and (12). We provide around 2,000 training instances at each depth. We then test models on a test set involving the other quantifiers (a, three, and all) at each embedding depth, like (13) and (14). If models can distinguish quantifier types during training, they can correctly compose meaning representations involving different combinations of multiple quantifiers. Note that this setting is easier than that for productivity in Table 2, in that models are exposed to some instances at each depth.  Table 10 shows that both models partially generalize to the cases where the depth is 1 or 2. However, both models fail to generalize to the cases where the depth is 3 or more. This suggests that even if models are trained with some instances at each depth, the models fail to learn distinctions between different quantifier types and struggle with parsing sentences whose embedding depth is 3 or more.

Conclusion
We have introduced an analysis method using SyGNS, a testbed for diagnosing the systematic generalization capability of DNN models in semantic parsing. We found that GRU and Transformer generalized to unseen combinations of semantic phenomena whose meaning representations are similar in forms to those in a training set, while the models struggle with generalizing to the others. In addition, these models failed to generalize to cases involving nested clauses. Our analyses using multiple meaning representation and evaluation methods also revealed detailed behaviors of models. We believe that SyGNS serves as an effective testbed for investigating the ability to capture compositional meanings in natural language.  Table 12 shows a set of context-free grammar rules and semantic composition rules we use to generate a fragment of English annotated with meaning representations in the SyGNS dataset. Each grammar rule is associated with two kinds of semantic composition rules formulated in λ-calculus.
One is for deriving first-order logic (FOL) formulas, and the other is for deriving variable-free (VF) formulas. For FOL, semantic composition runs in the standard Montagovian fashion where all NPs (including proper nouns) are treated as generalized quantifiers (Heim and Kratzer, 1998;Jacobson, 2014). From FOL formulas, we can extract the polarity of each content word using the monotonicity calculus (Van Eijck, 2005). Table 13 shows some examples of polarized FOL formulas. The derivation of VF formulas runs in two steps. To begin with, a sentence is mapped to a variable-free form by semantic composition rules. For instance, the sentence a small dog did not swim is mapped to a variable-free formula EXIST(AND(SMALL,DOG),NOT(SWIM)) by the rules in Table 12. Second, since this form is in prefix notation, all brackets can be eliminated without causing ambiguity. This produces the resulting VF formula EXIST AND SMALL DOG NOT SWIM.
Some other examples are shown in Table 13. DRSs are converted from FOL formulas in the standard way (Kamp and Reyle, 1993).

B Training details
We implemented the GRU model and the Transformer model using PyTorch. Both models were optimized using Adam (Kingma and Ba, 2015) at an initial learning rate of 0.0005. The hyperparameters (batch size, learning rate, number of epochs, hidden units, and dropout probability) were tuned by random search. In all experiments, we trained models on eight NVIDIA DGX-1 Tesla V100 GPUs. The runtime for training each model was about 1-4 hours, depending on the size of the training set. The order of training instances was shuffled for each model. We used 10% of the training set for a validation set.

C Detailed evaluation results
Effect of Model Size The results we report are from a model with 10M parameters. How does the number of parameters affect the systematic generalization performance of models? Table 11 shows  the performance of three models of varying size (large: 27M, medium: 10M, small: 4M). The number of parameters did not have a large impact on the generalization performance; all runs of the models achieved higher than 90% accuracy on the validation set and the test set involving quantifiers of the same type as the primitive quantifier, while they did not work well on the test set involving the other types of quantifiers. Table 14 shows all evaluation results by modifier types where two or every is set to the primitive quantifier. Regardless of primitive quantifier type, accuracies for problems involving logical connectives or adverbs were better than those for problems involving adjectives.

Modifier type
Monotonicity Table 15 shows all evaluation results of predicted FOL formulas and VF formulas based on monotonicity. We evaluate the precision, recall, and F-score for each monotonicity direction (upward and downward). Regardless of meaning representation forms, downward monotone accuracy on problems involving universal quantifiers is low. This indicates that both models struggle with learning the scope of universal quantifiers.

D Evaluation on systematicity of quantifiers and negation
We also analyze whether models can generalize to unseen combinations of quantifiers and negation.
Here, we generate Basic set 1 by setting an arbitrary quantifier to a primitive quantifier and combining it with negation. As in (15b), we fix the primitive quantifier to the existential quantifier one and generate the negated sentence One tiger did not run. Next, as in (16a) Table 15: Evaluation results on monotonicity. "Prec", "Rec", "F" indicate precision, recall, and F-score. and negations, like (17a) and (17b).
(15) a. One tiger ran b. One tiger did not run (16) a. Every tiger ran b. Two tigers ran (17) a. Every tiger did not run b. Two tigers did not run Table 16 shows the accuracy on combinations of quantifiers and negations by quantifier type. Similar to the results with unseen combinations of quantifiers and modifiers, models can easily generalize to problems involving quantifiers of the same type as the primitive quantifier. Table 17 shows the accuracy on combinations of quantifiers and negations by modifier types. Similar to the results in Table 14, the accuracies on problems involving logical connectives or adverbs were slightly better than those on problems involving adjectives.

E Error analysis of predicted DRSs
In the productivity experiments, the evaluation results using COUNTER on DRS prediction tasks are much higher than those by exact matching. Table 18 shows an example of predicted DRSs for the sentence all lions that did not follow two bears that chased three monkeys did not cry. This sentence contains embedded clauses with depth two, having the following gold DRS: x 1 lion(x 1 ) ¬ x 2 , x 3 two(x 2 ) bear(x 2 ) three(x 3 ) monkey(x 3 ) chase(x 2 , x 3 ) follow(x 1 , x 2 )) ⇒ ¬ cry(x 1 ) Both GRU and Transformer tend to correctly predict some of the clauses for content words, implication, and negation that appear at the beginning of the input sentence, while they fail to capture long-distance dependencies between subject nouns and verbs (e.g., all lions ... did not cry). Also, COUNTER correctly evaluates the cases where the order of clauses is different from that of gold answers.    Table 18: Error analysis of DRSs for the sentence "all lions that did not follow two bears that chased three monkeys did not cry". Clauses in green are correct and those in red are incorrect. "F" shows F-score over matching clause.