Not all quantifiers are equal: Probing transformer-based language models’ understanding of generalised quantifiers

,


Introduction
Generalised quantifiers have been a topic of much interest for more than a century in logic and linguistics (Frege, 1882;Westerståhl, 1987;Gabbay et al., 1989;Mostowski, 1957).By capturing the interplay between quantity and cardinality, they provide a useful lens through which to understand human language and cognition (Troiani et al., 2009;Szymanik and Zajenkowski, 2010a).Since transformer-based language models (TLMs) strive to stimulate humanlike language understanding (Vaswani et al., 2017;Devlin et al., 2018;Raffel et al., 2019;Ouyang et al., 2022;Chowdhery et al., 2022), it is essential to determine the extent to which they can comprehend generalised quantifiers.Assessing the depth of understanding that TLMs possess for any given concept is best achieved by evaluating their proficiency in applying it.In the case of generalised quantifiers, the most suitable evaluation task is textual entailment.This is particularly relevant because altering quantifiers can fundamentally change the logical inferences derived from a given text, reinforcing the integral role that quantifiers occupy within the scope of the textual entailment task.
When discussing entailment, it is vital to acknowledge two distinct strands of research in the literature.The first strand incorporates background knowledge and common sense into entailment, imbuing it with a probabilistic character (Bowman et al., 2015;Williams et al., 2018;Wang et al., 2019).The second strand examines textual entailment in a purely logical sense, eliminating the influence of background knowledge and common sense (Richardson and Sabharwal, 2021;Schlegel et al., 2022;Madusanka et al., 2023).While the first form of entailment proves beneficial for a multitude of practical applications, it is not ideal in an investigation centred on the impact of linguistic properties with logical significance, such as generalised quantifiers and negation.The empirical evaluation of linguistic constructs under this kind of entailment gets compromised due to its intricate association with other concepts.Consequently, it is challenging to differentiate the performance variation due to linguistic properties from those attributable to concepts like common sense and background knowledge.However, prior literature has only investigated generalised quantifiers in the context of entailment that incorporates background knowledge and common sense (Cui et al., 2022;Apidianaki and Garí Soler, 2021) and naturally suffers from the same predicament.The second strand of textual entailment by defining entailment in a purely logical sense circumvents the aforementioned shortcoming.Consequently, it offers a conducive environment for conducting evaluations Figure 1: An instance of the model-checking with natural language problem, the sentence "At least 3 musicians are guitarists" is T rue according to the structure since the set musicians X = {Roger, Solomon, Ava, Aria} are also guitarists and |X| ≥ 3.However, the sentence " All bee-keepers are scientists" are F alse as the set of bee-keepers {T alia, Solomon} are not scientists centred around linguistic constructs.
The logical problem that is most suited to study the influence of language constructs of logical significance is that of model-checking: given a formula ϕ and a structure A, determine whether ϕ is true in A (A |= ϕ).In the context of natural language, we are interested in a variant of the model-checking problem where the structure and the formula are translated into natural language.An instance of model-checking with natural language problem is depicted in Figure 1.From a complexity-theoretic point-of-view, modelchecking in most formal languages is, comparatively speaking, straightforward.Indeed the modelchecking problem with a fixed number of free variables and a finite structure is in PTIME.This is in contrast to other logical problems, such as satisfiability, whose problems for various fragments of logic can belong to different computational complexity classes (Pratt-Hartmann, 2004;Pratt-Hartmann and Third, 2006).Yet, solving instances of the model-checking problem with natural language requires a comprehensive understanding of the logical semantics of the expressions involved.Thus, it provides an ideal test environment to faithfully evaluate the extent to which generalised quantifiers affect transformer-based language models.
In this study, we embark on an in-depth investigation into TLMs' understanding of generalised quantifiers utilising the model checking problem, juxtaposing this with cognitive science research on quantifier verification tasks (Szymanik and Zajenkowski, 2010a,b;McMillan et al., 2005).A critical part of our exploration involves the evaluation of pre-trained models prior to any fine-tuning.Thus, allowing us to discern whether any differences identified are intrinsic to the models themselves or introduced through the process of finetuning.Additionally, we consider the complexities arising from the integration of Boolean con-junctions and negation with generalised quantifiers.This aspect of our study sheds light on the intricate dynamics that exist between these linguistic elements and the challenges they pose to TLMs.This comprehensive analysis paves the way for a more nuanced understanding of how TLMs handle intricate linguistic constructs such as generalised quantifiers.
The key contributions of the present research can be summarised as follows: (1) To the best of our knowledge, this study represents the first exploration into the effects of generalised quantifiers within a logical entailment context; (2) We analyse the effect on TLMs when quantifiers are paired with diverse logical constructs like negation and Boolean conjunctions; (3) We compare and contrast the behaviour of TLMs with quantifiers with that of quantifier verification experiments done with human beings; and (4) We delve into how well TLMs comprehend generalised quantifiers in a zero-shot context employing prompt engineering approaches such as chain-of-thought-prompting (Wei et al., 2022b) and provide comparisons between pre-trained and fine-tuned models.

Related Work
Our work follows the literature on probing how different linguistic properties affect the behaviour of neural approaches such as transformer-based language models (Madusanka et al., 2023;Clark et al., 2021;Buijtelaar and Pezzelle, 2023;Jawahar et al., 2019;Ettinger, 2020).Specifically, our investigation is closely related to the literature whose linguistic properties of interest are generalised quantifiers (Cui et al., 2022;Apidianaki and Garí Soler, 2021).Our exploration differentiates from prior research in two key ways.First, we explore generalised quantifiers employing a task that is defined purely in a logical sense.Thus, we provide a more faithful investigation into how TLMs comprehend The generalised quantifiers (GQ) we used in our experimental setup, along with their logical denotation defined on some structure A.
generalised quantifiers.Second, our research also integrates a comprehensive analysis of how the interaction of negations and Boolean conjunctions with quantifiers influences TLMs' performance in a simple entailment task.We follow the logical denotations introduced in logical studies to formalised generalised quantifiers when formulating our task (Westerståhl, 1987;Mostowski, 1957;Gabbay et al., 1989;Fuhrken, 1970;Peters and Westerståhl, 2006) and draw parallels with cognitive science work on quantifier verification in our experimental setup (Szymanik and Zajenkowski, 2010b;McMillan et al., 2005;Szymanik et al., 2016).
Our evaluation scheme for evaluating TLMs in a zero-shot setting builds upon prior literature on prompt engineering (Brown et al., 2020;Kojima et al., 2022;Wei et al., 2022b).However, ours is the first literature evaluating TLMs on the modelchecking problem in zero-shot settings.

Language Fragments and Generalised Quantifiers
We define a language fragment to be a set of sentence forms equipped with semantics translating those sentences to some formal system such as firstorder logic (Pratt-Hartmann, 2004) and perhaps the simplest way to define a language fragment is via a finite set of sentence templates.A sentence template is a sentence in which certain open-class words have been replaced by schematic variables.For example, "All As are Bs" is a sentence template where A and B substitute ordinary nouns (e.g., artist, musician beekeeper, ...), and by substituting A and B with such nouns, we can formulate sentences such as "All musicians are artists".Due to the formal structure that exists in language frag-ments, a set of sentence templates is a natural way of representing them.For example the Aristotelian syllogism (Smith et al., 1989) can be defined using the following set of templates, All As are Bs Some As are As No A is a B Some B are not Bs In this work of literature, we employ a slightly extended version of the Aristotelian syllogistic to allow negations at the subject, (e.g., Some nonmusicians are beekeepers) and generalised quantifiers when generating sentences.
Generalised quantifiers define the semantics of sentences that include them in terms of relations between subsets of the structure (Szymanik et al., 2013).Consider for example "All musicians are artists".The determiner phrase "All" in this sentence specifies a relation between the set of musicians and the set of artists, namely that the former is a subset of the latter.More generally, "All" in a structure A expresses the binary quantifier: This idea can be generalised to accommodate other quantifiers.Consider the sentence "At least K musicians are artists" where K ∈ N. The phrase "At least K" likewise expresses a relation between the set of musicians and the set of artists, namely the cardinality of their intersection is at least K, that is, "At least K"in A expresses the binary quantifier: In our scholarly inquiry, we examine logical quantifiers such as "All", numerical quantifiers such as "At least K" and propositional quantifiers such as "Most".The quantifiers employed, and their logical denotation on structure A are depicted in Table 1.We utilise these generalised quantifiers when defining language fragments for sentence generation.Let T Q be the sentence template which defines the language fragment corresponding to the quantifier Q.For example, consider the quantifier "All", the corresponding template T All takes the form of "All (non-)As are (not) Bs" where A and B are replaced by ordinary nouns.Appendix A depicts the sentence templates used to define language fragments for each of the quantifiers.

Data Construction
Algorithm 1 Data Construction -Model checking with Generalised Quantifiers Input : The Quantifier Q and corresponding sentence template T Q , a natural language template M to convert the structure to natural language, the vocabulary of proper nouns D and ordinary nouns P , minimum and maximum number of domain elements d min and d max , minimum and maximum number of predicates p min and p max Output : model checking dataset D A ← generate structure randomly using (D, P ) M ← translates A to natural language using the template M 13: D ← D ∪ {M, s, ℓ} 14: until stop condition is met We develop a data construction algorithm (Algorithm 1) to construct a balanced dataset free from easily exploitable trivial linguistic patterns.The algorithm constructs a set of triplets (M, s, ℓ), where M is the natural language translation of the structure, s is a sentence of the relevant fragment T Q and ℓ is a label (T rue/F alse) specifying whether s is true in M .To construct (M, s, ℓ), apart from T Q , the algorithm takes the vocabularies D, P also as inputs.The vocabulary D comprises proper nouns employed to characterise domain elements, while the vocabulary P comprises ordinary nouns that characterise predicates.We draw a random sample of elements D and P , from vocabularies D and P to construct the structure.Two random nouns are sampled from P , each is then negated with probability p neg , and these are finally substituted for the two schematic variables in the template T Q to form the sentence s.
Given (probably negated) words A, B, a generalised quantifier Q and a structure A, the modelchecker determines A |= s, where s is the sentence formed by substituting A, B for schematic variables in the templates T Q .This involves first determining the extensions of A and B in A and then applying the meaning of generalised quantifier Q to these sets.Consider the example put forth in the section 1, (Q, A, B) corresponding to the sentence "All bee-keepers are scientists" is (All, beekeepers, scientist).the model-checker determines the extensions of beekeepers and scientists in the corresponding structure A to be {T alia, Ava, Solomon} and {Hailee, Ava, T ony}, respectively.The quantifier All dictates that in order for the sentence to be T rue, the former needs to be a subset of the latter.However, the set {T alia, Ava, Solomon} ̸ ⊆ {Hailee, Ava, T ony}, thus, the model-checker assigns F alse as the validity label ℓ.
This setup with relative ease can be extended when Boolean conjunctions are introduced to the sentences.Consider, for instance, a sentence pair s 1 and s 2 , formed using the predicates (A 1 , B 1 ) and (A 2 , B 2 ), respectively, for some quantifier Q, merged using Boolean conjunction ⊙ ∈ {∧, ∨}.To adapt to this scenario, the algorithm can be augmented by effecting a simple modification to step 10,

Prompts for Zero-shot Evaluation
Given that transformer-based language models undergo pre-training through a certain form of language modelling objective, the most common approach to evaluate these models in the zero-shot setting is by employing prompt engineering (Brown et al., 2020).Consequently, we formulate prompts following a template-based strategy, utilising the constructed tuples (M, s, ℓ).We adopt two distinct types of templates.The first adheres to a more traditional form of prompting, which we refer to as standard prompting.The second type of template is based on the concept of chain-of-thought prompting (Wei et al., 2022b).Chain-of-thought prompting is a technique in which an example problem instance, accompanied by an explanation of the underlying thought process, is used to guide the model towards generating more precise responses.We depict the exact templates in Appendix A.

Transformer-based language models
To explore the transformer-based language models' ability to comprehend different generalised quantifiers, we employ a set of TLMs that have a proven track record in textual entailment problems, namely, T5, Flan-T5, DeBERTa, LLaMA and GPT.
T5 Following the prior work on textual entailment defined purely in a logical sense (Richardson and Sabharwal, 2021;Tafjord et al., 2021;Madusanka et al., 2023), we utilise the T5 model in our experimental setup as one of the baseline models.The T5 model (Raffel et al., 2019) employs a unified text-to-text format where all inputs and outputs are textual strings.We fine-tune the T5-large model with 770M parameters to perform the modelchecking task.
Flan-T5 Fine-tuned Language Net (Chung et al., 2022), also known as Flan, is based on instruction fine-tuning (Wei et al., 2022a) with the objective of making the transformer model generalise better to unseen tasks.The Flan-T5 model, considered to be an improvement to T5, applies instruction fine-tuning on the T5-model family.Thus, we primarily centred our experimental setup around the Flan-T5 model.We fine-tune the Flan-T5-large model with 770M parameters and utilise Flan-T5-base with 220M parameters, Flan-T5-large, Flan-T5-xl with 3B parameters and Flan-T5-xxl with 11B parameters in the zeroshot setting.

DeBERTa-v3
Due to the recent success of the DeBERTa-v3 model (He et al., 2021) in solving natural language inference tasks, we utilise it as a baseline model.The DeBERTa architecture improves upon the BERT and Roberta models using a disentangled attention mechanism and enhanced mask decoder.DeBerta-v3 further improves the architecture by utilising an ELECTRA-style pre-training with Gradient Disentangled Embedding Sharing.We fine-tune the DeBERTa-v3-large model with around 304M parameters.
ChatGPT Due to the recent success of ChatGPT in solving many natural language tasks in a zeroshot setting (Bang et al., 2023), we employ it in a similar context.Similar to InstructGPT (Ouyang et al., 2022), ChatGPT is trained to follow human instructions but follows a slightly different data collection approach.
LLaMA Considering that Flan-T5 and ChatGPT are trained to follow instructions, we decided to use a TLM which has not been explicitly trained to follow instructions as one of our baselines.Thus, we employ LLaMA-30B model in zero-shot settings.The LLaMA is said to outperform GPT-3 in most baselines and achieve comparable performance with respect to state-of-the-art TLMs (Touvron et al., 2023).

Dataset and Evaluation
To fine-tune and evaluate TLMs, we construct train and test sets with 72K and 36K unique problem instances with 8K and 4K data points for each generalised quantifier 1 .We arbitrarily select [d min , d max ] = [8, 14] and [p min , p max ] = [5, 10] when constructing problem instances.We construct a balanced dataset, and thus, we use accuracy as the main metric but chose to depict the overall precision, recall and f1-score to provide a more detailed analysis.We deem this setup answers the question, "How do different quantifiers affect the behaviour of TLMs?".As the problem instances contain negations, our experiment will also provide insight on the effect of negation when intertwined with quantifiers on TLMs' understanding of language.We construct separate train and test sets with 72K and 36K problem instances with sentences containing Boolean conjunctions to answer the question, "How do Boolean conjunctions affect the behaviour of TLMs when coupled with different quantifiers?".By evaluating these fine-tuned models against problem instances with higher K values in the numerical quantifiers than that of the train set, we ask the question "Do TLMs learn to comprehend the logical semantics of generalised quantifiers?".To answer the questions, "How do pre-trained TLMs comprehend different quantifiers?" and "Do they have any biases when performing a simple entailment task?", we evaluate TLMs in a zero-shot setting.We found that the problem instances with [d min , d max ] = [8, 14] and [p min , p max ] = [5, 10] are challenging for TLMs in zero-shot settings.Consequently, the use of the same test set did not yield any meaningful insights.Thus, we formulate a much simple problem instance with [d min , d max ] = [3, 6] and [p min , p max ] = [2,4].A more detailed description of the dataset and fine-tuning is provided in Appendix B.

Results and Discussion
The ability of transformer-based language models to solve instances of the model-checking problem is differentially influenced by various generalised quantifiers.As demonstrated in Table 2, TLMs appear to encounter the most difficulty with proportional quantifiers such as "Most" and "Few".Interestingly, this empirical observation aligns with cognitive science research, which also highlights the complexities faced by humans in interpreting proportional quantifiers (Szymanik and Zajenkowski, 2010a;McMillan et al., 2005;Troiani et al., 2009).In addition, the performance related to the quantifier "K" is notably lower in comparison to other numerical quantifiers.A sentence incorporating the quantifier "K" is probabilistically more likely to be F alse given a random structure.Therefore, in a balanced dataset, the determination of the truth value of a sentence containing the quantifier "K" necessitates a more detailed examination compared to other numerical quantifiers considered in this study.However, as illustrated in Figure 2, given an adequate number of training steps, all TLMs attain satisfactory performance levels across all quantifiers.Moreover, the newer TLM models, such as DeBERTa-v3 and Flan-T5, exhibit a faster convergence rate compared to T5.
As expressed by their precision and recall values, TLMs often predict T rue for quantifiers such as "All" and "Less than K", but often predict F alse with respect to quantifiers like "Some" and "More than K".We attribute this to be an overcorrection introduced during fine-tuning.Consider the sentence with the quantifier "Less than K": "Less than K artists are engineers".This sentence is more likely to be F alse in the context of the real world for K values we consider in this study (8 ≤ K ≤ 14) since there are more than 14 artists who are engineers in the world.This proposition remains true even when negations are introduced to the sentences.Thus, we speculate that TLMs overcorrect during fine-tuning and predict T rue or F alse accordingly.

GQ
TLMs show evidence of learning to understand the logical semantics of generalised quantifiers.As illustrated in Figure 3, when tested with a dataset containing higher K values than that of the train set, the accuracy of TLMs only decreases slightly for all numerical quantifiers.Therefore, we posit that TLMs possess the capacity to learn the logical semantics associated with generalised quantifiers.Our conclusions regarding generalisation bear resemblances to prior work conducted on model-checking with natural language (Madusanka et al., 2023).Their research also supports the premise that TLMs are capable of comprehending the logical semantics of natural language.Additionally, we highlight the contrast between the demonstrated ability of TLMs to generalise in the context of model-checking problems, and their apparent lack of such generalisation when solving satisfiability problems (Schlegel et al., 2022;Richardson and Sabharwal, 2021).We hypothesise that this distinction is due to the different complexity levels associated with these two types of problems and the necessity to understand complex inference rules when solving satisfiability problems.
The Boolean conjunctions have a significant effect, while negation has much less effect on fine-tuned TLMs when coupled with generalised quantifiers.As demonstrated in Table 3, it is apparent that fine-tuned TLMs possess the capacity to  Table 3: The test accuracy values for the Flan-T5-large model across various generalised quantifiers, broken down based on the Boolean conjunction (AN D, OR) in the sentence.The abbreviations ac, pr, re and f1 denote accuracy, precision, recall and F1 score values.comprehend negations.In contrast, as depicted in Figure 4 (c), prior to the fine-tuning process, negations exert a considerable influence.Consequently, we propose that fine-tuning plays a significant role in enhancing the ability of TLMs to understand negations.The inclusion of Boolean conjunctions significantly reduces the accuracy for all quantifiers.Noticeably, quantifiers for whom TLMs tend to predict T rue often tend to have higher accuracy in the context of the OR operation compared to AN D and vice-versa.In our findings, we discovered that quantifiers for which TLMs frequently predict the label as T rue also display elevated recall values for OR operations and diminished recall for AN D operations.Conversely, quantifiers that TLMs often predict as F alse exhibit higher precision for AN D operations and lower precision for OR operations.
The number of parameters, training process and type of prompting can influence the TLMs' performance when solving modelchecking problem instances in zero-shot settings.The performance for Flan-T5 models exhibits the power law relationship with the number of parameters, as illustrated in Figure 4 (a).This empirical finding is consistent with prior research analysing language models' performance variation with fac- tors such as the number of parameters, dataset size and computational resources (Kaplan et al., 2020).However, upon breaking down the performance metrics based on the quantifiers, the resulting graph (Figure 4 (b)) is observed to be less uniform compared to the representation of overall performance.We attribute this behaviour to the inherent probabilistic aspect of the predictions formulated by TLMs since language models are trained to find the most probable next word given a set of words.This probabilistic nature of language models can lead to inaccurate predictions, especially in a logical context.
The Flan-T5 model with fewer parameters outperformed the ChatGPT and LLaMA models in a zero-shot setting, as depicted in Table 5.This phenomenon is unsurprising since Flan-based models are very effective in tasks naturally verbalised as instructions due to their employment of instruction fine-tuning (Wei et al., 2022a).Upon contrasting the efficiency of ChatGPT and Flan-T5 models in the context of standard and chain-ofthought prompting techniques, it is observed that the discrepancy in accuracy metrics across these two distinct prompting methodologies is not substantial.However, the LLaMA model generated both T rue and F alse when generating the label when standard prompting is used, failing to follow the instruction properly.We attribute this failure in

GQ
s 0 n 0 s 0 n 1 s 1 n 0 The test accuracy values for the Flan-T5-large model across various generalised quantifiers breakdown based on the negations in the sentence.The s, n denote subject and predicate nominative, 1, 0 denotes having or not having a negation at s, n.For example s 0 n 1 denote no negation at subject and negation at predicate nominative.
the LLaMA model to its training process, which, unlike the other two models, is not trained to follow instructions.When subjected to the chain-ofthought prompting approach, the LLaMA model displayed more consistency, generating either a T rue or F alse label.Thus, we infer that the inclusion of examples assists the LLaMA model in generating more concise outputs.
Accuracy values for TLMs in zero-shot settings vary drastically with different quantifiers.As depicted in Table 5, TLMs struggle with numerical quantifiers whose cardinality of intersection has an upper bound, such as "At most K" and "Less than K".This diminished performance can be attributed primarily to the TLMs' tendency to predict the label F alse more frequently in sentences incorporating these specific quantifiers.We hypothesise this phenomenon is due to two factors.First, the background knowledge already embedded in TLMs from pretraining.As indicated previously, the sentences containing the above quantifiers coupled with a low K value are often F alse in a real-world scenario.Second, the prior cognitive research on quantifiers has demonstrated that quantifiers with a downward monotone, such as "Few" or "Less than K", present more processing challenges for humans compared to those with an upward monotone, such as "Most" and "More than K" (Geurts and van der Slik, 2005;Zeijlstra, 2020;Agmon et al., 2019).Since TLMs are trained on human-generated data, it is highly likely that these models have incorporated this cognitive trait into their understanding of language, which, in turn, affects their responses.Moreover, a deeper analysis of the answers generated through chain-of-thought prompting revealed that even when the predicted label is correct, the overall answer is often incoherent.This coherence deficit in TLMs, coupled with their difficulties in handling certain quantifiers, suggests that these models are yet to achieve proficiency in learning even the simplest inferential rules.

Conclusion
We investigated how generalised quantifiers affect the behaviour of transformer-based language mod-els by employing the problem of model-checking.We found that different generalised quantifiers affect TLMs in varying ways when solving modelchecking problems in both fine-tuned and zero-shot settings.Based on empirical findings on generalisation, we posited TLMs can learn to understand the logical semantics of generalised quantifiers.Moreover, our experimental setup in the zero-shot setting demonstrated that a multitude of factors, such as the training process, size of the models and type of prompts, can affect the ability of TLMs to solve a simple entailment task.Thus, a compelling avenue for future research is to probe how varying factors affect transformer-based language models when solving a more complex entailment task, like determining satisfiability.

Limitations
Due to the empirical nature of this study, it suffers from an inductive dilemma on three fronts.One, in the front of transformers, the second related to generalised quantifiers and the third in relation to prompts we explored in zero-shot settings.We explored several transformer-based language models that are in line with prior literature and probe how different generalised quantifiers affect their behaviour.Nonetheless, due to the empirical nature of this investigation, it is plausible that some TLM architectures could deviate from the behavioural norms discussed in this paper when interacting with generalised quantifiers.A similar limitation applies to the range of generalised quantifiers examined, as the ones employed in our study do not represent the entire spectrum of generalised quantifiers.In zero-shot settings, this limitation further extends to the prompt templates we employed.We consider two types of prompt templates, but there is a multitude of alternative ways prompts can be formulated by using the (M, s, ℓ) triplets.

A Appendix: Templates
A.1 Sentence Templates When constructing sentences, as mentioned in the methodology section, we employ sentence templates.Let Q be a generalised quantifier, and T Q be the sentence template for the corresponding quantifier Q.Then T Q take the general form, The inclusion of "non/not" is determined by the availability of the negations and A, B are ordinary nouns.Consider the quantifier "At most K" for some natural number K. Then the corresponding sentence template takes the form "At most K (non-) As are (not) Bs".Table 6 depicts sentence templates corresponding to quantifiers considered in this study.

A.2 Prompt templates for Zero-shot settings
We employ the tuples (M, s, ℓ) to delineate prompts for the language modelling objective, providing a framework for evaluating the effectiveness of TLMs in zero-shot settings.As mentioned in the methodology section, we explored two types of prompts.One, we informally called standard prompts and the other is based on chain-of-thought-prompting.The standard prompting is conceptualised by the following template, Q: Given the following scenario, M .Is the sentence s T rue or F alse according to the scenario?A: The chain-of-thought prompting employs an example problem instance with an explanation of the thought process, thereby facilitating a more precise response from TLMs.If we let (M e , s e , ℓ e ) represent this example problem instance and E elucidate the thought process, the chain of thought prompting can then be defined using the template, Q: Given the following scenario, M e .Is the sentence s e T rue or F alse according to the scenario?A: ℓ e .E Q: Given the following scenario, M .Is the sentence s T rue or F alse according to the scenario?A:  where |D| denotes the number of domain elements selected when formulating the problem instance.The minimum, maximum and mean number of tokens for problem instances of each quantifier is depicted in Table 7.To evaluate TLMs' behaviour with boolean conjunctions, we also constructed separate train and test sets with 72K and 36K data points.The dataset contains an equal number of problem instances for each conjunction, and quantifier pair.Moreover, since the intention was to compare the effects of generalised quantifiers on transformer-based language models, so we decided to use the simplest form of language templates, i.e. syllogistic.

GQ
We also emphasise the rationality behind the iterative approach we used in constructing the data.An alternative way of constructing problem instances is to derive the label ℓ using the model-checker instead of iteratively creating structures to match a pre-defined label and a sentence.However, this alternative approach can induce easily exploitable patterns.Consider the quantifier "K", "All" and "Some".For a random structure, quantifiers "K" and "All" are more likely F alse, while the quantifier "Some" is probabilistically T rue.

B.2 Fine-tuning Details
Formally, we define the task as a binary classification problem where the objective of the transformerbased language model is to predict the label ℓ (T rue or F alse) given the natural language interpretation of the structure M and the sentence s as the inputs.We select and fine-tune three TLMs, namely T5, Flan-T5, and DeBERTa-v3, all of which have previously demonstrated their efficiency and reliability in resolving textual entailment tasks.According to prior literature, the performance of TLMs mostly depends on the pretrained data, and size of the models rather than the architectural choice (Raffel et al., 2019;Kaplan et al., 2020).Moreover, the accuracy values yielded for all TLMs are similar.Thus, we expect a similar behaviour for other TLM architectures as well.Since the TLMs achieve satisfactory accuracy and since the central research interest is to analyse the behaviour of TLMs rather than identifying the best-performing TLM, we do not perform any hyperparameter tuning.Moreover, exploring several different TLMs and performing hyperparameter tuning leaves a higher carbon footprint (Strubell et al., 2019).Due to the nature of the research question, we consider such an exploration unnecessary.

Loss function and optimizer
We fine-tune each TLM to predict the label ℓ given the (M, s) by reducing the binary cross entropy loss over the target using the ADAM (Kingma and Ba, 2015) optimizer.
Batch size Utilising gradient checkpointing for memory-efficient fine-tuning, we set the batch size to 36.

Number of epochs
We fine-tune each TLM for 4 epochs, resulting in 8000 steps.maximum token length We set the maximum token length to 512 since the maximum problem length is much lower than that, thus, we do not truncate the inputs.learning rate We set the learninig rate of 1 × 10 −5 We utilise Huggingface (Wolf et al., 2019) implementation when experimenting with the TLM models we consider in this study.
← sample from vocabularies D and P such that d min ≤ |D| ≤ d max , p min ≤ |P | ≤ p max 4: A, B ← sample two predicates from P 5: A, B ← negate A, B with p neg 6: s ← substitute predicates A and B for schematic variables in the template T Q 7: ℓ ← sample from {T rue, F alse}

Figure 2 :
Figure 2: The rate of convergence of (a) Flan-T5-large and (b) DeBERTa-v3-large and (c) T5-large models break down based on different quantifiers

Figure 3 :
Figure3: The accuracy value when tested against problem instances with sentences containing higher K values than that of the train set.The results are broken down based on the numerical quantifier.

Figure 4 :
Figure 4: The (a) overall test loss and (b) test loss break down based on quantifier and (c) accuracy values breakdown based on the availability of negations for the Flan-T5 model with a different number of parameters in zero-shot settings, the number of parameters variates from 220M to 11B.

Table 2 :
The test scores for the Flan-T5-large model across various generalised quantifiers.The abbreviations ac, pr, re and f1 denote accuracy, precision, recall and F1 score values.

Table 5 :
The test accuracy values for the ChatGPT, Flan-T5-xxl and LLaMA-30B model in zero shot settings, st denotes standard-prompting approach while ch denotes the chain-of-thought prompting approach.