Not Wacky vs. Definitely Wacky: A Study of Scalar Adverbs in Pretrained Language Models

Vector-space models of word meaning all assume that words occurring in similar contexts have similar meanings. Words that are similar in their topical associations but differ in their logical force tend to emerge as semantically close – creating well-known challenges for NLP applications that involve logical reasoning. Pretrained language models such as BERT, RoBERTa, GPT-2, and GPT-3 hold the promise of performing better on logical tasks than classic static word embeddings. However, reports are mixed about their success. Here, we advance this discussion through a systematic study of scalar adverbs, an under-explored class of words with strong logical force. Using three different tasks involving both naturalistic social media data and constructed examples, we investigate the extent to which BERT, RoBERTa, GPT-2 and GPT-3 exhibit knowledge of these common words. We ask: 1) Do the models distinguish amongst the three semantic categories of MODALITY, FREQUENCY and DEGREE? 2) Do they have implicit representations of full scales from maximally negative to maximally positive? 3) How do word frequency and contextual factors impact model performance? We find that despite capturing some aspects of logical meaning, the models still have obvious shortfalls.


Introduction
Large pretrained language models such as BERT (Devlin et al., 2018) are trained using tasks which rely on the assumption that the meanings of words are revealed by the company they keep (Harris et al., 1954).In particular, Masked Language Modelling (MLM) is useful because information about the identity of the masked word is available from the words in the context.This assumption is highly appropriate for nouns and named entities.Different topics of discussion involve different entities, and in discussing any given topic, the associated entities will be referred to repeatedly.However, the assumption holds less well for other classes of words, which can be used in discussing virtually any topic.These include quantifiers (e.g.few, many, all), words expressing negation (e.g.not, no, never) and the focus of the present paper, which is scalar adverbs (e.g., slightly, very, completely).Many scalar adverbs tend to occur in similar contexts but have distinct meanings.As Abrusan et al. (2018) put it, distributional models tend to be 'blind' to logical meanings because the latter are topic independent and thus their meanings tend to not be reflected in their distributional contexts.For example, the meanings of slightly, very and completely can be defined in terms of positioning on the scale of the adjective they modify, but this scale could be TEMPERATURE in discussing the weather, but ACCURACY in discussing scientific advances.
Scalar adverbs are extremely common and accurate representation and processing of their meanings is pertinent to a wide variety of NLP applications, including sentiment analysis (De Melo and Bansal, 2013;Ruppenhofer et al., 2014), entailment inferences (McNally, 2017), detecting contradictions, and indirect Question Answering (De Marneffe et al., 2010).BERT family models succeed without fine-tuning on a remarkable variety of tasks, including filling semantic roles (Ettinger, 2020), determining entity types and relations (Tenney et al., 2019), resolving coreference (Kovaleva et al., 2019) and even open-domain question answering of knowledge base facts (Petroni et al., 2019).In a study that is strongly related to ours, unsupervised embeddings from BERT base have been used fairly successfully to rank gradable adjectives according to their scalar position on half-scales from a neutral to an extreme value (e.g., warm < hot < scalding) (Soler and Apidianaki, 2020).This result suggests that the BERT architecture might also be successful in exploiting diffuse or indirect cues about the ranking of scalar Category Adverbs MODALITY (14.8%) {maybe, perhaps, possibly}, arguably, probably, actually, certainly, definitely FREQUENCY (5.3%) never, occasionally, sometimes, often, generally, usually, frequently, always DEGREE (46.8%) hardly, slightly, basically, pretty, quite, very, really, completely Table 1: Scalar adverbs in each semantic category ranked by scalar position using semantic theory and WordNet definitions.Bracketed items are tied.Percentages for each category are the overall percentage of that category in the Reddit slice, in relation to the set containing the target adverbs plus not adverbs.On the other hand, the same models have had little success in representing negation (Ettinger, 2020) or antonymy (Talmor et al., 2020).In order to thoroughly explore the coding of scalar adverbs, we look at full scales and directly compare scalar adverbs to explicit negation.We consider the 24 adverbs in Table 1, selected through the process described in Section 3.These vary widely in frequency, with very found 22818 times in the Reddit slice and frequently found only 52 times.At 40986 occurrences, not is more frequent than any scalar adverb (see Appendix for full adverb frequencies).

Background
We begin by introducing some concepts from linguistic semantics and pragmatics that motivate our study.In NLP, the workhorse of document retrieval is the fact that the topic under discussion hugely influences what entities people refer to.According to semantic theory, individual unique entities in the real world are referred by proper nouns; common nouns refer to sets of entities.The bursts in uses of proper and common nouns associated with the current topic -and the lulls in uses of the same words when the topic changes -provide the basis for the distributional hypothesis about word meanings.
In contrast to nouns, many other types of words have more complicated semantic structures.Partee (1992) develops a typology of word types according what implicit variables they contain.In Altmann et al. (2009), this typology is simplified and applied to explain why some types of words are typically much less bursty than nouns.Of particular relevance here is the distinction between entities and operators.Operators are words like tall, quite, supposedly that have hidden variables supplied by the context.The many different ways of filling in these hidden variables means that they can be used in many different contexts.As a corollary, Altmann et al. (2009) demonstrate that they are much less bursty than words referring to entities.
A class of such words that has recently attracted considerable attention is gradable adjectives such as cold or tall.These adjectives position the expression they modify on a scale.By using them, the speaker indicates that the modified expression has a position on the scale exceeding a given threshold, which can be determined by combining Bayesian inference about speaker informativity and prior domain knowledge (Lassiter and Goodman, 2013).By hearing water described as cold, or someone described as tall, a pragmatic listener will combine their prior knowledge about water temperatures, or people's heights, with an assumption that the speaker has a reason for using a modifier; the temperature or height being described is distant enough from what might be expected to warrant the cost of an extra word.This means that formal semantic representation of tall contains a hidden variable, representing the threshold whose value is contextually determined.
The adverbs in our study themselves modify scalar adjectives, introducing a further level of abstraction into their semantic representations.Consider the following sentences: Adverbs of DEGREE act by simply moving the degree threshold of the original adjective (Bennett and Goodman, 2018), i.e., very cold water has an inferred range of temperatures on average lower than cold water.Adverbs of frequency, on the other hand, do not act on the (continuous) intensity of a single event, but rather describe a point on a scale of discrete, bounded occurrences of the relevant property (Doetjes, 2007).Lastly, modal or epistemic adverbs do not modify the adjectival property itself, but instead are an evaluation of the likelihood of the property by the speaker (Lang, 1979).Because they pertain to different scales, these categories can be freely combined without engendering contradications: E.g. Mary can be often slightly angry, certainly slightly angry, occasionally very angry, or sometimes definitely angry.
Altmann et al. (2009) shows that modifiers, such as tall, tend to be less bursty than entities and common nouns, and higher-level operators, such as certainly, tend to be even less bursty than modifiers.These observations raise questions about the applicability of the distributional hypothesis to operators.If two operators that contrast in their logical force, such as always and never, are equally available in a wide variety of contexts, it follows that the particular context might provide little information about which one the speaker actually selected.This conjecture receives further support from Röttger and Pierrehumbert (2021).They study which types of words are responsible for the improvement from temporal adaptation of BERT for MLM of social media data.They find that named entities and common nouns arising as the topic of discussion changes over time time provide most of the benefit.Adverbs and adjectives contribute little.
In addition to MLM, our study explores entailment.A critical observation is that entailment may only be strictly defined over word relationships that involve the same scale.For example, if it is very cold, it is at least somewhat cold.But if it is very cold, it is unclear whether it is at least sometimes cold.
Thus, in this paper we ask the following questions1 : 1. Are pretrained language models able to distinguish between different semantic categories of scalar adverbs?We use SpaCy (Honnibal and Montani, 2017) to extract phrases of the form 'ADV ADJ.' where there is a dependency between the adjective and the adverb.We take only phrases in which this construction occurs in final position, so that the phrases are also guaranteed to be well-formed in isolation, and we limit the context to two sentences (or one if no more are available) and 10 to 40 words.Aiming for at least 40 different adjectives to occur with each target adverb, we selected 8 distinct adverbs that expressed the speaker's judgment on a scale of likelihood (MODALITY), 8 that express a position on a temporal scale (FREQUENCY), and 8 with more of general applicability (DEGREE).These were selected to span the full range of each scale, and hence include adverbs with negative force i.e., hardly and never.However, the word not is reserved as a benchmark and not used as a target.Both the adverbs themselves, and the ADV ADJ bigrams, were selected to span the range of available frequencies to the extent possible, using Google Ngram (Lin et al., 2012) frequencies.The target adverbs are listed in Table 1.
According to Paradis (1998), some of our chosen scalar adverbs are 'maximizers' (e.g., completely) which tend to occur with extreme adjectives (e.g., freezing) or limit adjectives (e.g., dead).Some are 'scalar' adverbs which combine with more stereotypically scalar adjectives (e.g., cold).However, these are tendencies rather than rules (Kamoen et al., 2011).Indeed, phrases such as very dead or completely cold can be perfectly acceptable in some contexts.(e.g., I can assure you he was very dead or By that time he was completely cold).Therefore, we do not restrict our phrases to traditional scalar adjectives and include any occurrences involving the target adverbs.Example items and targets can be seen in Table 2.
These items were used as such in an MLM task.They also form the basis for the construction of items for adverb-ranking task and an entailment task.

Tasks
Reflecting discussion in Talmor et al. (2020), Liu et al. (2021) and Jiang et al. (2022), our main tasks are zero-shot evaluations (without fine-tuning) so as to examine the representations learned from pretraining.We first evaluate the extent to which the rankings in Table 1 can be recovered from the embedding space in BERT and RoBERTa.We then look at MLM (as one of the training objectives for the models) and finally entailment (as a canonical logical task).In Section 4.6, we also briefly consider models which have been fine-tuned on a Natural Language Inference dataset (e.g., MNLI, Williams et al., 2018).

Ranking adverbs by scalar position
Following Soler and Apidianaki (2020), our first question is whether the rankings of the various scalar adverbs along the relevant scales are observable in the embeddings.Resources that provide scalar rankings for adverbs are scarce, and the few available, such as Taboada et al. ( 2011), confound scalar position with other factors.Therefore, we defined our own gold standard (cf.Table 1), on the basis of WordNet definitions.
We combined each target scalar adverb with (the same) 40 common adjectives: able, bad, big, black, clear, different, early, easy, economic, federal, free, full, good, hard, high, human, important, international, large, late, little, local, low, military, national, new, old, only, political, possible, public, real, recent, right, small, social, special, strong, white, young.We first applied both methods described in Soler and Apidianaki (2020) for assess-ing the scalar position of scalar adjectives.Their first method (SIM) uses a reference point, specifically the top end of each scale, and computes the cosine similarity for each target from the reference point; the similarity should decrease as we move down the scale.Their second method (DIFF) uses the difference between between the maximum and minimum words on a scale to define an abstract vector of scale position; the cosine similarity of any word to this vector is taken to indicate its scale position.To adapt this method to our full-scale scalar adverbs, we take the difference between the maximal element in each semantic class (e.g always, definitely, completely) and the respective non-negative bottom adverb (sometimes, maybe, slightly), and again compute cosine similarities for the other adverbs.
As we will see, neither of these methods performed very well for full scales of scalar adverbs.So, we also devised a third method (AdjDIFF).broadly inspired by Maillard and Clark (2015) and Socher et al. (2012)'s work on semantic composition for nouns.Given that pretrained language models create contextualised representations of each token in a sentence, we reason that the representation of an adjective modified by the presence of a scalar adverb will integrate information about the scalar adverb compared to an unmodified adjective, and that this difference may correlate with the scalar adverbs' position on the scale.Thus, we obtain embeddings for each adjective with and without the scalar adverb.We subtract the unmodified embedding from the modified embedding of the adjective to obtain an estimate of the vector for the scalar adverb.As in the other two methods, we then take the cosine similarity of each resulting vector with the same referent vector as in the second method (i.e., the difference between top and bottom adverb of the relevant scale) and average them to obtain the final cosine similarity value.
We report the pairwise accuracy, Spearman ρ and tie corrected Kendall's τ for RoBERTa, BERT large and BERT base.
The results can be found in Table 3.We note that our method outperforms the two methods used in Soler and Apidianaki (2020)   negative values of Spearman ρ and Kendall's τ is particularly problematic, and we note that the FREQUENCY category shows the worst performance.However the overall accuracy of 0.64 for the BERTAdjDIFF method indicates that some information about the relations has been captured by the model.It is also interesting that the best performing model here is not the larger RoBERTa but BERT.

Masked Language Modelling
MLM is one of two training objectives for BERT, and the only training objective for RoBERTa.For BERT or RoBERTa to form good representations of the scalar adverbs, the larger context needs to contain information about which ones are most likely in any given instance, even if a good variety of them are logically possible and in principle acceptable.According, we directly evaluate the raw Masked Language Modelling outputs for the target phrases we selected.How well does each model predict a scalar adverb in a context when it is masked?Based on the discussion in Section 2, we expect this task to be extremely difficult.For most of our examples, it appears that humans would be unlikely to succeed in predicting the masked word.However, through learned attention weights, BERT and RoBERTa integrate information over a large time window.It could be the case that they perform better than expected, potentially better than humans do.Potential sources of information about the scalar adverb include collocations or selectional restrictions involving the following adjective, and rhetorical devices or idiomatic expressions that involve the preceding context.Hence, we systematically explore the success of MLM in predicting a masked adverb.If MLM is successful, that means that the predictive information is present in a way that is not intuitively evident, whereas if MLM fails, that tends to suggest that predictive information is simply lacking in the text stream.
In light of the difficulty of the task, we report two measures.One is the Mean Reciprocal Rank (MRR) for the original adverb, which scores high if the adverb that occurred is ranked highly even if it is not the one that actually appeared.The Mean Reciprocal Rank for each adverb is defined as Where N is the number of items for the original target adverb and rank adv is the rank of the original target adverb among the model's predictions.
Our other metric is whether the model ranked the original adverb above not.In our materials, replacement with not is always syntactically and semantically possible.However, it would reverse the meaning of the phrase.This reversal either explicitly contradicts previous context (the case for 60% of our examples), or doesn't (40% of our examples)2 .Thus in the majority of cases, not should be ranked as less likely than any scalar adverb, with the exception of other negative polarity items such as hardly and never.We test three models in the BERT family: BERT base, BERT large and RoBERTa large.We also test GPT-2 (Radford et al., 2019), which unlike BERT family models is an autoregressive model and helps us to see whether the right-hand context contains relevant information.We use the pretrained cased BERT large and BERT base from the Huggingface's transformers library (Wolf et al., 2019), replacing our target scalar adverb with the [MASK] token and obtaining the logits which we convert back into probabilities.
We use neutralised versions of the sentences e.g., 'is ADV ADJ.' as a baseline for predictions.This provides the BERT-family models with a syntactic cue plus any selectional biases from the ADJ.

MLM Results
Results can be found in Table 4.All models perform extremely poorly in the neutral context, indicating that adjectives alone are not sufficient to predict adverbs.(The GPT2-neutral condition of course has no success, since GPT2 does not use right-hand context).The results for the full context are better.Both BERT large and BERT base get a significant boost from the full context both in upranking the original adverb (MRR doubling for both models) and in ranking the original adverb above negation.RoBERTa performs best overall.The MODALITY category gets the highest boost from context, from 0.02 to 0.57, this may be due in part to the fact that English lacks a negative item in this category.Error analysis shows that BERT still yields negation as the top prediction or among the top predictions even in cases where the context makes it unlikely (eg., not is the top prediction for the first two examples in Table 2).
We construct confusion matrices between original target adverb and the top output prediction for each example for each model and context.We select the first of our target adverbs in the top 10 outputs, or the category 'other' when none of the top 10 predictions appears in our list.The heatmaps with context can be seen in Figure 1 (the full set of heatmaps, including outcomes for the neutral context, can be found in the Appendix Figure 2 ).While there is some indication of ability to predict the original adverb from BERT (i.e., the faintly lit up diagonal), it is clear that the decision is strongly driven by prior frequency effects, with not and very topping the predictions for all targets.RoBERTa gets a better performance, as is evidenced by the more strongly lit up diagonal, but frequency effects still dominate (the vertical lines for items such as very).The figure is laid out so that within-category confusions would show up on a block diagonal pattern.No such pattern exists, indicating that scalar adverbs within the same semantic category do not emerge as particularly similar.
The fact that the models enjoy some success when provided with the left-hand context of the target indicates that the left-hand context -unlike the right-hand context -contains information about which scalar adverb is more likely in which instance.However, because of the naturalistic and varied nature of our examples, it is uncertain whether this success derives from learning logical or distributional relations and would robustly extend to examples with minimal context.
Because of this, we create another synthetic dataset in order to assess performance in a more controlled way.

Scalar entailment task
The adverb rankings in Table 1 can readily be translated into entailments.Evaluating putative entailments is an established test of logical reasoning: It is always cold entails It is sometimes cold, but the reverse is not true.If the models have reliably learned the logical relations between scalar adverbs during pretraining, they should rank completions which create correct entailments higher than completions which create contradictions.We set up an entailment task as a MLM task in two conditions, which we illustrate now with example items constructed from the ADV ADJ combination often special.For the BELOW condition, we create items where we expect an adverb which is below the premise adverb on the relevant scale, e.g.If it is often special, then it is at least [MASK] special (sometimes, occasionally, etc.).For the ABOVE condition, we expect a completion which is above the premise adverb, e.g., If it is [MASK] special, then it is at least sometimes special (often, usually, etc.).We craft items using eight different templates for each condition (varying order of premise and mask as well as subordinating conjunction).These can be found in the Appendix along with detailed results for each template.
We omit the scalar adverbs which are technically negations (hardly, never) and omit 'bottom' scalar adverbs (sometimes, maybe, slightly) for the BELOW condition and top scalar adverbs (always, definitely, completely) for the ABOVE condition, since we do not have correct answers available for these.
We used 160 adjectives systematically varied in frequency.From our Reddit data, we selected adjec-   tives in the low frequency range (log frequency -18 to -14), medium frequency range (log frequency -14 to -10) and high frequency range (log frequency -10 to -6).We use the recent wordfreq Python library for this purpose (Speer, 2022), which is sourced from the Exquisite Corpus project and compiles multilingual data from 8 domains.In addition, we selected 40 pseudo words as adjectives from the highest ranked items in (Needle et al., 2022) under the constraint that they are not compounds of real words, so that the pretrained models' WordPiece tokenization does not introduce any previous information.We combine these 160 adjectives with our target adverbs for each template and condition to create a dataset of 40960 sentences for which we collect MLM completions from BERT large and RoBERTa.

Entailment results
Just as we did for constructing the confusion matrices in our first task, we select the first answer from the top 10 completions of the model which is in our list (including negative items: not, hardly and never), and the category 'other' if no completion in the top 10 is found in the relevant category.(e.g., If it is sometimes strong, then it is sometimes strong) as well as 'other' answers which do not pertain to the target semantic category (e.g., mostly when the category is temporal).However, we do report trivial answer percentages separately.We verify empirically that the chance performance on our dataset is 50% by running a model which randomly picks an adverb in the relevant category (result: 0.48 accuracy and 0.13 trivial answers).
The results when taking into account negative answers are very poor.The models output a high percentage of negations (especially 'not') even though negations constitute logical contradictions for all sentences in the entailment dataset.
To get a more nuanced picture of the models' behaviour and support comparisons with the MLM results, we also build heatmaps without taking negative answers into account, and calculate accuracies without negations.Results (with and without negations) can be seen in table 5.Both sets of heatmaps (including negative answers) can be found in Figures 3 and 4 in the Appendix.When choosing the top relevant answer excluding negations the results improve drastically to near ceiling.However, the models most likely benefit from biases in both conditions: in the ABOVE condition, the most frequent items (always, actually, very, see adverb frequencies in Appendix) constitute correct answers in a majority of cases and in the BELOW condition all templates have a textual hint (at least/at most) which strongly suggests an item outside the top of the scale (?at least/at most always/completely/definitely).The benefits from the bias towards high frequency top-of-scale adverbs are probably stronger than those from the textual hints which explains why the models perform worse in the BELOW condition.
The adjective frequency does not seem to affect performance, which is in fact slightly better for low frequency adjectives and pseudowords.Indeed, performance in RoBERTa is better overall for less frequent items and drops less for the less frequent and pseudo adjectives (perhaps indicating that the model is less 'distracted' by noise from frequent items).This provides further evidence that the models are not memorizing ADV-ADJ combinations.This observation is strengthened by adverb frequency effects which prevail across scales (i.e., vertical lines in the heatmaps e.g., 'slightly', 'really') in both BERT large and RoBERTa.The rate of trivial answers in both models also appears to be far above what would be expected from humans (although this remains to be tested by collecting human judgments).
To summarize, both BERT large and RoBERTa show very poor ability to distinguish between nonnegative scalar adverbs and negation.The models perform well if we consider first completions excluding negations.However they most likely benefit from frequency biases which makes it doubtful whether they learned a separate logical representation of the adverbs' scalar property.The models also output a high number of trivial, uninformative completions and seem affected by noise associated with frequent adjectives.Finally, differing performance on the ABOVE and BELOW conditions, which use the same constructions and are logically equivalent, indicates that neither model has achieved a general grasp of the underlying logic.

Further probes
GPT-3 is widely acknowledged to perform better than GPT-2.We probe GPT-3 (text-davinci-002) (Brown et al., 2020) on a sample of 5120 sentences from our dataset, with a prompt mimicking the MLM task to obtain ten completions for the MASK token in our dataset.The results can be seen in Table 6 along with the results from RoBERTa on the same sample dataset.While GPT-3 outputs a much lower number of trivial and negative completions, it seems to be even more biased towards high frequency top-of-the-scale answers, causing it to perform poorly, especially in the BELOW condition.
To evaluate whether models which have been fine-tuned on a Natural Language Inference dataset (e.g., MNLI, Williams et al., 2018) we also adapt the dataset to the MNLI format.MNLI involves a sentence pair classification task where crowdsourcers were asked to label the relations between two ordered sentences as entailment, neutral or contradiction.Entailment in the MNLI dataset is related to entailment in our task, however, it is worth noting that for NLI datasets, the term entailment is more loosely defined than its strict logical counterpart.It is more akin to 'sentence B could be reliably inferred from sentence A'.We adapted our items to MNLI format by replacing the [MASK] token at random with an adverb creating a correct entailment for half our items and an incorrect one for the remaining items (e.g.(Williams et al., 2018).pseudo = sentences with pseudo words from (Needle et al., 2022) low = sentences with adjectives log frequencies -18 to -14, med = sentences with adjectives log frequencies -14 to -10, high = sentences with adjective log frequencies -10 to -6, acc.= accuracy (dataset is balanced with chance = 0.50) always cold.'.We obtain predictions from MNLI fine-tuned, BERT large and RoBERTa models.
The results can be found in Table 7.As can be seen, the models perform near chance, indicating that training on a dataset containing broadly defined inferences such as MNLI does not improve performance on a strict entailment task involving scalar adverbs.

Conclusion
The goal of this paper was to examine how well pretrained language models such as BERT and RoBERTa represent and process full scales of scalar adverbs in the absence of any specific task fine-tuning.We used naturalistic data from Reddit and also constructed sentences in order to explore the language models' ability to predict different types of scalar adverbs in context, and to distinguish them from negation.The models achieved some success when a left context of up to 40 words was available.However, we note many shortcomings in relation to human performance: weak differentiation amongst the semantic classes of adverbs, poor ability to discriminate scalar adverbs from negations even in contexts where a negation would create a contradiction, strong effects of adverb frequencies and lack of generalisation across two logically equivalent entailment constructions.

Limitations
Scale: While our list of adverbs was carefully curated to include different semantic categories, full scales (including negations) and downsizer adverbs (e.g., slightly) unlike in previous works, they are a restricted sample of only 24 adverbs.While we do believe this is a representative list, it is by no means an exhaustive one and the conclusions drawn in this paper have to be confined to the semantic categories explored.In addition we cannot exclude the possibility that experiments using a larger list of adverbs would produce different results.
Models: While we tried to explore the predictions from different types of pretrained models (i.e., GPT and BERT), we acknowledge that we did not run an extensive study of models from different families.This is in part because these are the most commonly used models in applications, but also because our study is qualitative and we were mostly interested in comparing the models' outputs with and without context and comparing performance between our semantic categories, rather than between different models.We also wished to focus on open-source models for which we could extract embeddings to explore a potential subspace for scalar properties.
Gold standard: There are few resources for providing gold standard labels of position on the scale for scalar adverbs in general, and especially when including different semantic categories and downsisers as well as maximisers.This limited our available choices for scalar adverbs to investigate.We provided the gold standard labels for the list of adverbs, based on information in WordNet and claims in the research literature, excluding adverbs whose semantics appeared unclear.While these rankings are informed by our best knowledge of semantics as experienced linguists, they were provided by a few researchers rather than by gathering judgments from many crowdsourcers as in other studies.

A Heatmaps
Figure 1: Heatmap of confusion matrices per scalar adverb for each model with context (items are grouped by semantic category).In the interest of space considerations, we only show results for BERT large and RoBERTa Figure 2: Heatmap of confusion matrices per intensifier for model and context in the MLM task (items are grouped by semantic category).In the interest of space considerations, we only show results for BERT large and RoBERTa.

Table 2 :
Example target phrases and sentences even though it yields worse performance for adverbs than what they obtained for adjectival half-scales.The existence of

Table 4 :
Accuracies ('acc'), i.e., number of times the original adverb was ranked above not) and Mean Reciprocal Rank (MRR) for each adverb and semantic category.

Table 5 :
Results for scalar entailment dataset (BERT large and RobERTa).pseudo = sentences with pseudo words from (Needle et al., 2022) low= sentences with adjectives log frequencies -18 to -14, med= sentences with adjectives log frequencies -14 to -10, high= sentences with adjective log frequencies -10 to -6, (no neg) = not taking into account negations as answers, acc.= accuracy, triv.= number of trivial answers (e.g., If it's sometimes ADJ, it's sometimes ADJ, which we do not take into account for accuracies), below = expects item below on the relevant scale, above = expects item above on the relevant scale (best in bold).

,
It is always cold.It is sometimes cold.vs.It is sometimes cold.It is

Table 6 :
Results for sample scalar entailment dataset.pseudo = sentences with pseudo words from (Needle et al., 2022) low = sentences with adjectives log frequencies -18 to -14, med = sentences with adjectives log frequencies -14 to -10, high = sentences with adjective log frequencies -10 to -6, (no neg) = not taking into account negations as answers, acc.= accuracy, triv.= number of trivial answers (e.g., If it's sometimes ADJ, it's sometimes ADJ, which we do not take into account for accuracies), below = expects item below on the relevant scale, above = expects item above on the relevant scale (best in bold).

Table 7 :
Results for scalar entailment with models fine tuned on MNLI

Table 8 :
Relative frequencies of our scalar adverbs

Table 9 :
Results for scalar entailment per template and example template (with negations) (best in bold).

Table 10 :
Results for scalar entailment per template and example template (without negations)(best in bold).