Numeric Magnitude Comparison Effects in Large Language Models

Large Language Models (LLMs) do not differentially represent numbers, which are pervasive in text. In contrast, neuroscience research has identified distinct neural representations for numbers and words. In this work, we investigate how well popular LLMs capture the magnitudes of numbers (e.g., that $4<5$) from a behavioral lens. Prior research on the representational capabilities of LLMs evaluates whether they show human-level performance, for instance, high overall accuracy on standard benchmarks. Here, we ask a different question, one inspired by cognitive science: How closely do the number representations of LLMscorrespond to those of human language users, who typically demonstrate the distance, size, and ratio effects? We depend on a linking hypothesis to map the similarities among the model embeddings of number words and digits to human response times. The results reveal surprisingly human-like representations across language models of different architectures, despite the absence of the neural circuitry that directly supports these representations in the human brain. This research shows the utility of understanding LLMs using behavioral benchmarks and points the way to future work on the number representations of LLMs and their cognitive plausibility.


Introduction
Humans use symbols -number words such as "three" and digits such as "3" -to quantify the world.How humans understand these symbols has been the subject of cognitive science research for half a century.The dominant theory is that people understand number symbols by mapping them to mental representations, specifically magnitude representations (Moyer and Landauer, 1967).This is true for both number words (e.g., "three") and digits (e.g., "3").These magnitude representations are organized as a "mental number line" (MNL), with numbers mapped to points on the line as shown in Figure 1d.Cognitive science research has revealed that this representation is present in the minds of young children (Ansari et al., 2005) and even nonhuman primates (Nieder and Miller, 2003).Most of this research has been conducted with numbers in the range 1-9, in part, because corpus studies have shown that 0 belongs to a different distribution (Dehaene and Mehler, 1992) and, in part, because larger numbers require parsing place-value notation (Nuerk et al., 2001), a cognitive process beyond the scope of the current study.
Evidence for this proposal comes from magnitude comparison tasks in which people are asked to compare two numbers (e.g., 3 vs.7) and judge which one is greater (or lesser).Humans have consistently exhibited three effects that suggest recruitment of magnitude representations to understand numbers: the distance effect, the size effect, and the ratio effect (Moyer and Landauer, 1967;Merkley and Ansari, 2010).We review the experimental evidence for these effects, shown in Figure 1, in LLMs.Our behavioral benchmarking approach shifts the focus from what abilities LLMs have in an absolute sense to whether they successfully mimic human performance characteristics.This approach can help differentiate between human tendencies captured by models and the model behaviors due to training strategies.Thus, the current study bridges between Natural Language Processing (NLP), computational linguistics, and cognitive science.

Effects of Magnitude Representations
Physical quantities in the world, such as the brightness of a light or the loudness of a sound, are encoded as logarithmically scaled magnitude representations (Fechner, 1860).Research conducted with human participants and non-human species has revealed that they recruit many of the same brain regions, such as the intra-parietal sulcus, to determine the magnitude of symbolic numbers (Billock and Tsou, 2011;Nieder and Dehaene, 2009).Three primary magnitude representation effects have been found using the numerical comparison task in studies of humans.First, comparisons show a distance effect: The greater the distance |x − y| between the numbers x vs. y, the faster the comparison (Moyer and Landauer, 1967).Thus, people compare 1 vs. 9 faster than 1 vs. 2.This is shown in abstract form in Figure 1a.This effect can be explained by positing that people possess an MNL.When comparing two numbers, they first locate each number on this representation, determine which one is "to the right", and choose that number as the greater one.Thus, the farther the distance between the two points, the easier (and thus faster) the judgment.
Second, comparisons show a size effect: Given two comparisons of the same distance (i.e., of the same value for |x − y|), the smaller the numbers, the faster the comparison (Parkman, 1971).For example, 1 vs. 2 and 8 vs. 9 both have the same distance (i.e., |x − y| = 1), but the former involves smaller numbers and is therefore the easier (i.e., faster) judgment.The size effect is depicted in abstract form in Figure 1b.This effect also references the MNL, but a modified version where the points are logarithmically compressed, i.e., the distance from 1 to x is proportional to log(x); see Figure 1d.To investigate if a logarithmically compressed number line is also present in LLMs, we use multidimensional scaling (Ding, 2018) on the cosine distances between number embeddings.
Third, comparisons show a ratio effect: The time to compare two numbers x vs. y is a decreasing function of the ratio of the larger number over the smaller number, i.e., max(x,y) min(x,y) (Halberda et al., 2008).This function is nonlinear, as depicted in abstract form in Figure 1c.Here, we assume that this function is a negative exponential, though other functional forms have been proposed in the cognitive science literature.The ratio effect can also be explained by the logarithmically compressed MNL depicted in Figure 1d.
These three effects -distance, size, and ratiohave been replicated numerous times in studies of human adults and children, non-human primates, and many other species (Cantlon, 2012;Cohen Kadosh et al., 2008).The MNL model in Figure 1d accounts for these effects (and many others in the mathematical cognition literature).Here, we use LLMs to evaluate a novel scientific hypothesis: that the MNL representation of the human mind is latent in the statistical structure of the linguistic environment, and thus learnable.Therefore, there is less need to posit pre-programmed neural circuitry to explain magnitude effects.

LLMs and Behavioral Benchmarks
Modern NLP models are pre-trained on large corpora of texts from diverse sources such as Wikipedia (Wikipedia contributors, 2004) and the open book corpus (Zhu et al., 2015).LLMs like BERT (Devlin et al., 2018), ROBERTA (Liu et al., 2019) and GPT-2 (Radford et al., 2019) learn contextual semantic vector representations of words.
These models have achieved remarkable success on NLP benchmarks (Wang et al., 2018).They can perform as well as humans on a number of language tests such as semantic verification (Bhatia and Richie, 2022) and semantic disambiguation (Lake and Murphy, 2021).
Most benchmarks are designed to measure the absolute performance of LLMs, with higher accuracy signaling "better" models.Human or superhuman performance is marked by exceeding certain thresholds.Here, we ask not whether LLMs can perform well or even exceed human performance at tasks, but whether they show the same performance characteristics as humans while accomplishing the same tasks.We call these behavioral benchmarks.The notion of behavioral benchmarks requires moving beyond accuracy (e.g., scores) as the dominant measure of LLM performance.
As a test case, we look at the distance, size, and ratio effects as behavioral benchmarks to determine whether LLMs understand numbers as humans do, using magnitude representations.This requires a linking hypothesis to map measures of human performance to indices of model performance.Here, we map human response times on numerical comparison tasks to similarity computations on number word embeddings.

Research Questions
The current study investigates the number representations of LLMs and their alignment with the human MNL.It addresses five research questions: 1. Which LLMs, if any, capture the distance, size, and ratio effects exhibited by humans? 2. How do different layers of LLMs vary in exhibiting these effects?3. How do model behaviors change when using larger variants (more parameters) of the same architecture?4. Do the models show implicit numeration ("four" = "4"), i.e., do they exhibit these effects equally for all number symbol types or more for some types (e.g., digits) than others (e.g., number words)? 5. Is the MNL representation depicted in Figure 1d latent in the representations of the models?

Related Work
Research on the numerical abilities of LLMs focuses on several aspects of mathematical reasoning (Thawani et al., 2021), such as magnitude com-parison, numeration (Naik et al., 2019;Wallace et al., 2019), arithmetic word problems (Burns et al., 2021;Amini et al., 2019), exact facts (Lin et al., 2020), and measurement estimation (Zhang et al., 2020).The goal is to improve performance on application-driven tasks that require numerical skills.Research in this area typically attempts to (1) understand the numerical capabilities of pretrained models and (2) propose new architectures that improve numerical cognition abilities (Geva et al., 2020;Dua et al., 2019).
Our work also focuses on the first research direction: probing the numerical capabilities of pretrained models.Prior research by Wallace et al. (2019) judges the numerical reasoning of various contextual and non-contextual models using different tests (e.g., finding the maximum number in a list, finding the sum of two numbers from their word embeddings, decoding the original number from its embedding).These tasks have been presented as evaluation criteria for understanding the numerical capabilities of models.Spithourakis and Riedel (2018) change model architectures to treat numbers as distinct from words.Using perplexity score as a proxy for numerical abilities, they argue that this ability reduces model perplexity in neural machine translation tasks.Other work focuses on finding numerical capabilities through building QA benchmarks for performing discrete reasoning (Dua et al., 2019).Most research in this direction casts different tasks as proxies of numerical abilities of NLP systems (Weiss et al., 2018;Dua et al., 2019;Spithourakis and Riedel, 2018;Wallace et al., 2019;Burns et al., 2021;Amini et al., 2019).
An alternative approach by Naik et al. ( 2019) tests multiple non-contextual task-agnostic embedding generation techniques to identify the failures in models' abilities to capture the magnitude and numeration effects of numbers.Using a systematic foundation in cognitive science research, we build upon their work in two ways: we (1) use contextual embeddings spanning a wide variety of pre-training strategies, and (2) evaluate models by comparing their behavior to humans.Our work looks at numbers in an abstract sense, and is relevant for the grounding problem studied in artificial intelligence and cognitive science (Harnad, 2023).
The current study addresses this gap.We propose a general methodology for mapping human response times to similarities computed over LLM embeddings.We test for the three primary magnitude representation effects described in section 1.1.

Linking Hypothesis
In studies with human participants, the distance, size, and ratio effects are measured using reaction time.Each effect depends on the assumption that when comparing which of two numbers x and y is relatively easy, humans are relatively fast, and when it is relatively difficult, they are relatively slow.The ease or difficulty of the comparison is a function of x and y: |x − y| for the distance effect, min(x, y) for the size effect, and max(x,y) min(x,y) for the ratio effect.LLMs do not naturally make reaction time predictions.Thus, we require a linking hypothesis to estimate the relative ease or difficulty of comparisons for LLMs.Here we adopt the simple assumption that the greater the similarity of two number representations in an LLM, the longer it takes to discriminate them, i.e., to judge which one is greater (or lesser).
We calculate the similarity of two numbers based on the similarity of their vector representations.Specifically, the representation of a number for a given layer of a given model is the vector of activation across its units.There are many similarity metrics for vector representations (Wang and Dong, 2020): Manhattan, Euclidean, cosine, dot product, etc.Here, we choose a standard metric in distributional semantics: the cosine of the angle between the vectors (Richie and Bhatia, 2021).This reasoning connects an index of model function (i.e., the similarity of the vector representations of two numbers) to a human behavioral measure (i.e., reaction time).Thus, the more similar the two representations are, the less discriminable they are from each other, and thus the longer the reaction time to select one over the other.

Materials
For these experiments, we utilized three formats for number representations in LLMs: lowercase number words, mixed-cased number words (i.e., the first letter is capitalized), and digits.These formats enable us to explore variations in input tokens and understand numeration in models.Below are examples of the three input types: • "one", "two", "three", "four" ... "nine" • "One", "Two", "Three", "Four" ... "Nine" • "1", "2", "3", "4" ... "9" As noted in the Introduction, prior studies of the distance, size and ratio effects in humans have largely focused on numbers ranging from 1 to 9. Our input types are not-affected by tokenization methods as the models under consideration have each input as a separate token.

Large Language Models -Design Choices
Modern NLP models are pre-trained on a large amount of unlabeled textual data from a diverse set of sources.This enables LLMs to learn contextually semantic vector representations of words.We experiment on these vectors to evaluate how one specific dimension of human knowledge -number sense -is captured in different model architectures.
We use popular large language models from Huggingface's Transformers library (Wolf et al., 2020) to obtain vector representations of numbers in different formats.Following the work by Min et al. (2021) to determine popular model architectures, we select models from three classes of architectural design: encoder models (e.g., BERT (Devlin et al., 2018)), auto-regressive models (e.g., GPT-2 (Radford et al., 2019)), and encoder-decoder models (e.g., T5 (Raffel et al., 2019)).The final list of models is provided in Table 1.
Operationalization: We investigate the three number magnitude effects as captured in the representations of each layer of the six models for the three number formats.For these experiments, we consider only the obtained hidden layer outputs for the tokens corresponding to the input number word tokens.We ignore the special prefix and suffix tokens of models (e.g., the [cls] token in BERT) for uniformity among different architectures.For the T5-base model, we use only the encoder to obtain model embedding.All models tested use a similar number of model parameters (around 110-140 million parameters).For our studies, we arbitrarily choose the more popular BERT uncased variant as opposed to the cased version.We compare the two models in Appendix section A.2 for a complete analysis, showing similar behaviors in the variants.Model size variations for the same architecture are considered in the Appendix section A.1 to show the impact of model size on the three effects.Recall that the distance effect is that people are slower (i.e., find it more difficult) to compare numbers the closer they are to each other on the MNL.We use the pipeline depicted in Figure 1 to investigate if LLM representations are more similar to each other if the numbers are closer on the MNL.
Evaluation of the distance effect in LLMs is done by fitting a straight line (a + bx) on the cosine similarity vs. distance plot.We first perform two operations on these cosine similarities: (1) We average the similarities across each distance (e.g., the point at distance 1 on the x-axis represents the average similarity of 1 vs. 2, 2 vs. 3, ..., 8 vs. 9).( 2) We normalize the similarities to be in the range [0, 1].These decisions allow relative output comparisons across different model architectures, which is not possible using the raw cosine similarities of each LLM.To illustrate model performance, the distance effects for the best-performing layer in terms of R 2 values for BART are shown in Figure 2 for the three number formats.All of the models show strong distance effects for all layers, as shown in Table 2, and for all number formats, as shown in Table 3.Interestingly, LLMs are less likely to reveal the distance effect as layer count increases (Table 2).For example, layer one results in the strongest distance effect while layer twelve is the least representative of the distance effect.With respect to number format, passing digits as inputs tended to produce stronger distance effects than passing number words (Table 3); this pattern was present for four of the six LLMs (i.e., all but T5 and BERT).

The Size Effect
The size effect holds for comparisons of the same distance (e.g., for a distance of 1, these include 1 vs. 2, 2 vs. 3, ..., 8 vs. 9).Among these comparisons, those involving larger numbers (e.g., 8 vs. 9) are made more slowly (i.e., people find them more difficult) than those involving smaller numbers (e.g., 1 vs. 2).That larger numbers are harder to differentiate than smaller numbers aligns with the logarithmically compressed MNL depicted in Figure 1d.This study evaluates whether a given LLM shows a size effect on a given layer for numbers of a given format by plotting the normalized cosine similarities against the size of the comparison, defined as the minimum of the two numbers being compared.For each minimum value (points on the x-axis), we average the similarities for all comparisons to form a single point (vertical compression).We then fit a straight line (ax + b) over the vertically compressed averages (blue line in Figure 3) to obtain the R 2 values (scores).To illustrate model performance, the size effects for the best-performing layer of the BERT-uncased model (in terms of R 2 values) are shown in Figure 3. Similar to the results for the distance effect, the high R 2 values indicate a human-like size effect.
Interestingly, Table 4 generally shows an increasing trend in the layer-wise capability of capturing the size effect across the six LLMs.This is opposite to the trend observed across layers for the distance effect.Table 5 shows that using digits as the input values yields significantly better R 2 values than the other number formats.In fact, this is the only number format for which the models produce strong size effects.However, the vertical compression of points fails to capture the spread of points across the y-axis for each point on the x-axis.This spread, a limitation of the size effect analysis, is captured in the ratio effect (section 4.3).The ratio effect in humans can be thought of as simultaneously capturing both the distance and size effects.Behaviorally, the time to compare x vs. y is a decreasing function of the ratio of the larger number over the smaller number, i.e., of max(x,y)  min(x,y) .In fact, the function is nonlinear as depicted in Figure 1c.For the LLMs, we plot the normalized cosine similarity vs. max(x,y)  min(x,y) .To each plot, we fit the negative exponential function a * e −bx + c and   evaluate the resulting R 2 .To illustrate model performance, Figure 4 shows the ratio effects for the best-fitting layer of the BART model for the three number formats.As observed with the distance and size effect, the high R 2 values of the LLMs indicate a human-like ratio effect in the models.

Multidimensional Scaling
Along with the three magnitude effects, we also investigate whether the number representations of LLMs are consistent with the human MNL.To do so, we utilize multidimensional scaling (Borg and Groenen, 2005;Ding, 2018).MDS offers a method for recovering the latent structure in the matrix of cosine (dis)similarities between the vector representations of all pairs of numbers (for a given LLM, layer, and number format distance between each pair of points is consistent with the cosine dissimilarity between their vector representations. We fix N = 1 to recover the latent MNL representation for each LLM, layer, and number format.For each solution, we anchor the point for "1" to the left side and evaluate whether the resulting visualization approximates the log compressed MNL as shown in Figure 1d.To quantify this approximation, we calculate the correlation between the positions of the numbers 1 to 9 in the MDS solution and the expected values (log(1) to log (9)) of the human MNL; see Table 8.All inputs have similar correlation values.Surprisingly, GPT-2 with digits as the number format (and averaged across all layers) shows a considerably higher correlation with the log-compressed MNL than all other models and number formats.The average correlation between latent model number lines and the log compressed MNL decreases over the 12 layers; see Table 9.
We visualize the latent number line of GPT-2 by averaging the cosine dissimilarity matrix across layers and number formats, submitting this to MDS, and requesting a one-dimensional solution; see Figure 5.This representation shows some evidence of log compression, though with a few exceptions.One obvious exception is the right displacement of 2 away from 1. Another is the right displacement of 9 very far from 8.
To better understand if this is a statistical artifact of GPT-2 or a more general difference between number understanding in humans versus LLMs, we perform a residual analysis comparing positions on the model's number line to those on the human MNL.We choose the digits number format, estimate the latent number line representation averaged across the layers of each model, and compute the residual between the position of each number in this representation compared to the human MNL.This analysis is presented in Table 10.For 1, all models show a residual value of less than 0.03.This makes sense given our decision to anchor the latent number lines to 1 on the left side.The largest residuals are for 2 and 9, consistent with the anomalies noticed for the GPT-2 solution in Figure 5.These anomalies are a target for future research.We note here that 2 is often privileged even in languages such as Piraha and Mundurucu that have very limited number of word inventories (Gordon, 2004;Pica et al., 2004).Further note that 9 has special significance as a "bargain price numeral" in many cultures, a fact that is often linguistically marked (Pollmann and Jansen, 1996).

Ablation studies: Base vs Large Model Variants
We investigate changes in model behaviors when increasing the number of parameters for the same architectures.We use the larger variants of each of the LLMs listed in Table 1.The detailed tabular results of the behaviors are presented in Appendix section A.1; see Tables 11, 12, and 13.Here, we summarize key takeaways from the ablation studies: • The distance and ratio effects of the large variants of models align with human performance characteristics.Similar to the results for the base variants, the size effect is only observed when the input type is digits.• We observe the same decreasing trend in the layer-wise capability of capturing the distance effect, ratio effect, and the MDS correlation values in the Large variants of LLMs as observed in the base variants.The increasing trend in the layer-wise capability of the size effect is not observed in the Larger LLMs.• Residual analysis shows high deviation for the numbers "2", "5", and "9"; which is in line with our observations for the base variations.

Conclusion
This paper investigates the performance characteristics in various LLMs across numerous configurations, looking for three number-magnitude comparison effects: distance, size, and ratio.Our results show that LLMs show human-like distance and ratio effects across number formats.The size effect is also observed among models for the digit number format, but not for the other number formats, showing that LLMs do not completely capture numeration.Using MDS to scale down the pairwise (dis)similarities between number representations produces varying correspondences between LLMs and the logarithmically compressed MNL of humans, with GPT-2 showing the highest correlation (using digits as inputs).Our residual analysis exhibits high deviation from expected outputs for the numbers 2, 5, 9 which we explain through patterns observed in previous linguistics studies.The behavioral benchmarking of the numeric magnitude representations of LLMs presented here helps us understand the cognitive plausibility of the representations the models learn.Our results show that LLM pre-training allows models to approximately learn human-like behaviors for two out of the three magnitude effects without the need to posit explicit neural circuitry.Future work on building pre-trained architectures to improve numerical cognition abilities should also be evaluated using these three effects.

Limitations
Limitations to our work are as follows: (1) We only study the three magnitude effects for the number word and digit denotations of the numbers 1 to 9. The effects for the number 0, numbers greater than 10, decimal numbers, negative numbers, etc. are beyond the scope of this study.Future work can design behavioral benchmark for evaluating whether LLMs shows these effects for these other number classes.
(2) The mapping of LLM behaviors to human behaviors and effects might vary for each effect.Thus, we might require a different linking hypothesis for each such effect.We only use the models built for English tasks and do not evaluate multi-lingual models.(4) We report and analyze aggregated scores across different dimensions.There can be some information loss in this aggregation.
(5) Our choice of models is limited by certain resource constraints.Future works can explore the use of other foundation / super-large models (1B parameters +) and API-based models like GPT3 and OPT3.( 6) The behavioral analysis of this study is one-way: we look for human performance characteristics and behaviors in LLMs.
Future research can utilize LLMs to discover new numerical effects and look for the corresponding performance characteristics in humans.This could spur new research in cognitive science.( 7) The results show similar outputs to low dimensional human output and show that we do not need explicit neural circuitry for number understanding.We do not suggest models actually are humanlike in how they process numbers.
Gail Weiss, Yoav Goldberg, and Eran Yahav.2018.On the practical computational power of finite precision rnns for language recognition.CoRR, abs/1805.04908.
Thomas For the models in Table1, we show the three effects for the larger variants.The variants have the same architectures and training methodologies as their base variants but more parameters ( thrice the number of parameters).The in-depth results for the   and size effect as inputs (column: Total Averages; tables 2, 4) and the ratio effect (column: Total Averages; Table4) as output.Importantly, the distance effect averages are statistically significant predictors of ratio effect averages; see Table 23).These results provide a superficial view of the impact of distance and size effect in the ratio effect scores because of the aggregation performed at different levels of the study.D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure 1: The input types, LLMs, and effects in this study.The three effects are depicted in an abstract manner in sub-figures (a), (b), (c).

Figure 2 :
Figure 2: Distance effect for the best-performing layer (9th layer) for the BART model

Figure 3 :
Figure 3: Size effect for the best-performing layer for the BERT model (layer 11).
Figure 4: Ratio effect for the best-performing layer for the BART model (layer 3).

Figure 5 :
Figure 5: MDS visualization on averaged distances of the GPT-2 model for all number formats and layers.

Table 3 :
Distance Effect: Averaged (across layers) R 2 values of different LLMs on the three numbers when fitting a linear function.LC: Lowercase number words, MC: Mixed-case number words.

Table 4 :
The high R 2 values indicate a human-like distance effect.
Size Effect: Averaged (across inputs) R 2 values of different LLMs on different input layers when fitting a linear function.RoB: Roberta-base model, BERT: uncased variant.

Table 5 :
Size Effect: Averaged (across layers) R 2 values of different LLMs on the three number formats when fitting a linear function.LC: Lowercase number words, MC: Mixed-case number words.

Table 6 :
Ratio Effect: Averaged (across layers) R 2 values of different LLMs on different number formats when fitting a negative exponential function.LC: Lowercase number words, MC: Mixed-case number words.

Table 7 :
Ratio Effect: Averaged (across number formats) R 2 values of different LLMs on different input layers when fitting a negative exponential function.

Table 8 :
Averaged (across layers) correlations when comparing MDS values with Log 10 1 to Log 10 9 for different LLMs.LC: Lowercase number words, MC: Mixed-case number words.

Table 9 :
).It arranges each number in a space of N dimensions such that the Averaged (across inputs) correlations of different LLMs on different model layers when comparing MDS values with Log 10 1 to Log 10 9. RoB: Robertabase model, BERT: uncased variant.

Table 11 :
Averaged distance effect, size effect, ratio effect, and the MDS correlation values for the different input types of the models.

Table 12 :
Residual analysis on MDS outputs in 1 dimension on the large variants of the models.RoB: Roberta-base model, BERT: uncased variant.

Table 17 :
Size Effect: Averaged (across inputs) R 2 values of different Larger variants of LLMs for different layers when fitting a linear function.RoB: Robertabase model, BERT: uncased variant.

Table 19 :
Ratio Effect: Averaged (across inputs) R 2 values of different Larger variants of LLMs for different layers when fitting a negative exponential function.RoB: Roberta-base model, BERT: uncased variant.

Table 20 :
Averaged (across layers) correlation values when comparing MDS values with Log 10 1 to Log 10 9 for Large variants of different LLMs.LC: Lowercase number words, MC: Mixedcase number words.

Table 21 :
Averaged (across inputs) correlation values of the Large variants of different LLMs on different model layers when comparing MDS values with Log 10 1 to Log 10 9. RoB: Roberta-base model, BERT: uncased variant.

Table 22 :
Behavioral differences between the cased and uncased variants of the BERT architecture.LC: Lowercase number words, MC: Mixed-case number words.

Table 23 :
Impact of layer-wise trends of distance and size effect on the ratio effect; indicates statistical significance with p-value less that 0.01, ⊕ indicates statistical significance with p-value less that 0.00001 C2.Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?The experimental design and the operationalization of experiments are given in sections 3 and 4.C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Section 4C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Section 3.3 D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.