Can Language Models Encode Perceptual Structure Without Grounding? A Case Study in Color

Pretrained language models have been shown to encode relational information, such as the relations between entities or concepts in knowledge-bases — (Paris, Capital, France). However, simple relations of this type can often be recovered heuristically and the extent to which models implicitly reflect topological structure that is grounded in world, such as perceptual structure, is unknown. To explore this question, we conduct a thorough case study on color. Namely, we employ a dataset of monolexemic color terms and color chips represented in CIELAB, a color space with a perceptually meaningful distance metric. Using two methods of evaluating the structural alignment of colors in this space with text-derived color term representations, we find significant correspondence. Analyzing the differences in alignment across the color spectrum, we find that warmer colors are, on average, better aligned to the perceptual color space than cooler ones, suggesting an intriguing connection to findings from recent work on efficient communication in color naming. Further analysis suggests that differences in alignment are, in part, mediated by collocationality and differences in syntactic usage, posing questions as to the relationship between color perception and usage and context.


Introduction
Without grounding or interaction with the world, language models (LMs) learn representations that encode various aspects of formal linguistic structure (e.g., morphosyntax (Tenney et al., 2019)) and semantic information (e.g., lexical similarity (Reif et al., 2019a)).Beyond this, it has been suggested that text-only training data is enough for LMs to also acquire factual and relational information about the world (Davison et al., 2019;Petroni et al., 2019).This includes, for instance, some * For correspondence: {abdou,soegaard}@di.ku.dk features of concrete and abstract concepts, such as objects' attributes and affordances (Forbes et al., 2019b;Weir et al., 2020).Furthermore, the representational geometry of LMs has been found to naturally reflect human lexical similarity and relatedness judgements, as well as analogy relationships (Chronis and Erk, 2020).However, the extent to which these models reflect the structures that exist in humans' perceptual world-such as the topology of visual perception (Chen, 1982), the structure of the color spectrum (Ennis and Zaidi, 2019;Provenzi, 2020), or of odour spaces (Rossiter, 1996;Chastrette, 1997)-is not well-understood.
If LMs are indeed able to capture such topologies-in some domains, at least-it would mean that these structures are a) somehow reflected in language and, thereby, encoded in the textual training data on which models are trained, and b) learnable using models' current training objectives and architectural inductive biases.To the extent they are not, the question becomes whether the information is not there in the data, or whether model and training objective limitations are to blame.Certainly, this latter point relates to an ongoing debate regarding what exactly language models can be expected to learn from ungrounded form alone (Bender and Koller, 2020;Bisk et al., 2020;Merrill et al., 2021).While there have been many inter-arXiv:2109.06129v2[cs.CV] 14 Sep 2021 esting theoretical debates around this topic, few studies have tried to address this question empirically.
In this paper, we conduct a case study on color.Indeed, color perception in humans and its relation to speakers' use of color terms has long been the subject of studies in cognitive science (Kay and McDaniel, 1978;Berlin and Kay, 1991;Regier et al., 2007;Kay et al., 2009).To this end, spaces have been defined in which Euclidean distances between related colors are correlated with reported perceptual differences. 1In addition, the semantics of color terms have long been understood to hold particular linguistic significance, as they are theorised to be subject to universal constraints that arise directly from the neurophysiological mechanisms and properties underlying visual perception and cognition (Kay and McDaniel, 1978;Berlin and Kay, 1991;Kay et al., 1991). 2 Due to these factors, color offers a useful test-bed for investigating whether or not structural information about the topology of the perceptual world might be encoded in linguistic representations.
To explore this in detail, we employ a dataset of English color terms and their corresponding color chips 3 , the latter of which are represented in CIELAB -a perceptually uniform color space.In addition to the color chip CIELAB coordinates, we extract linguistic representations for the corresponding color terms.With these two representations in mind (see Figure 1 for a demonstrative plot from our experiments), we employ two methods of measuring structural correspondence, with which we evaluate the alignment between the two spaces.Figure 2 shows an illustration of the experimental setup.We find that the structures of various language model representations show alignment with the structure of the CIELAB space, demonstrating that some approximation of perceptual color space topology can indeed be learned from text alone.
1 The differences between color stimuli which are perceived by human observers.
2 These theories have been contested by work arguing for linguistic relativism (cf. the Sapir-Whorf Hypothesis), which emphasizes the arbitrariness of language and the relativity of semantic structures and minimizes the role of universals.Such critiques have, however, been accommodated for in the Berlin & Kay paradigm (Berlin and Kay, 1991), the basic assumptions of which, such as the existence of at least some perceptually-determined universal constraints on color naming, remain widely accepted.
3 Each chip is a unique color sample from the Munsell chart, which is made up of 330 such samples which cover the space of colors perceived by humans.See §2.
We also show that part of this distributional signal is learnable by simple models -e.g.models based on pointwise mutual information (PMI) statistics -although large-scale language model pretraining (e.g., BERT) encodes the topology markedly better.
Analysis shows that larger language models align better than smaller ones and that much of the variance in CIELAB space can be explained by low-dimensional subspaces of LM-induced color term representations.To better understand the results, we also analyse the differences in alignment across the color spectrum, observing that warm colors are generally better aligned than cool ones.Further investigation reveals a connection to findings reported in work on communication efficiency in color naming, which posits that warmer colors are communicated more efficiently.Finally, we investigate various corpus statistics which could influence alignment, finding that a measure of color term collocationality based on PMI statistics corresponds to lower alignment, while the entropy of a color term's dependency relation distribution (i.e.terms occurring as adjectival modifiers, nominal subjects, etc.) and how often it occurs as an adjectival modifier correspond to a stronger one.

Methodology
Color data We employ the Color Lexicon of American English, which provides extensive data on color naming.The lexicon consists of 51 monolexemic color name judgements for each of the 330 Munsell Chart color chips4 (Lindsey and Brown, 2014).The color terms are solicited through a free-naming task, resulting in 122 terms.
Perceptual color space Following previous work (Regier et al., 2007;Zaslavsky et al., 2018;Chaabouni et al., 2021), we map colors to their corresponding points in the 3D CIELAB space, where the first dimension L expresses lightness, the second A expresses position between red and green, and the third B expresses the position between blue and yellow.Distances between colors in the space correspond to their perceptual difference.
Language models Our analysis is conducted on three widely used language models (LMs): BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), both of which employ a masked language modelling objective, and ELECTRA (Clark et al.,  2020), which is trained instead with a discriminative token replacement detection objective. 5aselines In addition to the aforementioned language models, we consider two different baselines: • PMI statistics, which are computed6 for the color terms in common crawl, using window sizes of 1 (pmi-1), 2 (pmi-2), and 3 (pmi-3).
The result is a vocabulary length vector quantifying the likelihood of co-occurrence of the color term with every other vocabulary item in within that window.• Word-type FastText embeddings trained on Common Crawl (Bojanowski et al., 2017).

Controlled context
To control for the effect of variation in the sentence contexts used to construct color term representations, we employ a templative approach to generate a set of identical contexts for all color terms.When generating controlled contexts, we create three frames in which the terms can appear: • COPULA: the <obj> is <col> • POSSESSION: i have a <col> <obj> • SPATIAL: the <col> <obj> is there We use these frames in order to limit the contextual variation across colors (<col>) and to isolate their representations amidst as little semantic interference as possible, all while retaining a naturalistic quality to the input.We also aggregate over numerous object nouns (<obj>), which the color terms are used to describe.We select objects from the McRae et al. (2005) data which are labelled in the latter as plausibly occurring in many colors and which are stratified across 13 category sets, e.g.fan ∈ APPLIANCES, skirt ∈ CLOTHING, etc. Collapsing over categories, we generate sentences combinatorially across frames, objects and color terms, resulting in 3 × 122 × 18 = 6588 sentences, 366 per term.

Evaluation
We employ two complimentary evaluation methods to gauge the correspondence of the color term text-derived representations to the perceptual color space.The first, Representation Similarity Analysis (RSA), is non-parametric and uses pairwise comparisons of stimuli to provide a measure of the global topological alignment between two spaces.The second employs a learned linear mapping, evaluating the extent to which two spaces can be aligned via transformation (rotation, scaling, etc.).
RSA (Kriegeskorte et al., 2008) is a method of relating different representational modalities, which was first employed in neuroscientific studies.RSA abstracts away from activity patterns themselves (e.g.neuron values in representational vectors) and instead computes representational (dis)-similarity matrices (RSMs), which characterize the information carried by a given representation method through global (dis)-similarity structure.Kendall's rank correlation coefficient (τ ) is computed between RSMs derived from the two spaces, providing a summary statistic indicative of the overall representational alignment between them.RSA is non-parametric and therefore circumvents many of the various methodological weaknesses associated with the probing paradigm (Belinkov, 2021).
For each color term, we compute a centroid in the CIELAB space following the approach described in Lindsey and Brown (2014).Each centroid is defined as the average CIELAB coordinate of the samples (i.e.color chips) that were named with the corresponding term (across the 51 subjects).This results in N parallel points in the color term embedding and perceptual color spaces, where N is the number of color terms considered.For our analysis, we exclude color terms used less frequently than a cutoff f = 100 in the color lexicon, leaving us with the 18 most commonly used color terms. 7We then separately construct an N × N RSM for each of the LM spaces and for CIELAB .Each cell in the RSM corresponds to the similarity between the activity patterns associated with pairs of experimental conditions n i , n j ∈ N .
For the color term embedding space, we employ Pearson's correlation coefficient (r) as a similarity measure between each pair of embeddings n i , n j ∈ N .For the CIELAB space, we elect to use the following method, per Regier et al.'s (2007) where c is a scaling factor (set to 0.001 in all ex-periments reported here) and dist(n i , n j ) is the CIELAB distance (∆ E_CMC * )8 between chips n i and n j .This similarity measure is derived from the psychological literature on categorization and is meant to model the assumption that beyond a certain distance colors appear entirely different, so that increasing the distance has no further effect on dissimilarity.Finally, we report the mean Kendall's τ between the color term embedding and color space RSMs.We also report τ per color term (i.e. per row in the RSM), which corresponds to how well-aligned each individual color term is.
Linear mapping We train regularised linear regression models to map from color term embedding space is a linear map and α is the lasso regularization hyper parameter.We vary α across a wide range of settings to examine the effect of probe complexity, which we measure using the nuclear norm of the linear projection matrix et al., 2020).The fitness of the regressors, evaluated using n-fold cross-validation (n = 6) indicates the alignability of the two spaces, given a linear transformation.Centroids corresponding to each Munsell color chip are computed in the color term embedding space via the weighted mean of the embeddings of the 51 terms used to label it.As in the RSA experiments, terms occurring less frequently than the cutoff (f = 100) are excluded.For evaluation, we compute the average (across splits and datapoints) proportion of explained variance as well as the ranking of a predicted color term embedding according to the Pearson distance (1 − r) to gold.Hewitt and Liang (2019), we construct a random control task for the linear mapping experiments, wherein we randomly swap each color chip's CIELAB code for another.This is meant to break the mapping between the color chips and their corresponding terms.Control task results are reported as the mean of 10 different random re-mappings.We report probe selectivity, which is defined as the difference between proportion of explained variance in the standard experimental condition and in the control task (He- witt and Liang, 2019).We run similar control for the RSA experiments, where the CIELAB space centroids are randomly shuffled.

Results
Table 1 shows the max, mean, and standard deviation (across layers) of alignment scores for each of the LMs, per alignment method and setting.For RSA, we observe significant correlations across all configurations: most LM layers show a topological alignment with color space.Notably, this is also true for the static embeddings and for one of the PMI baselines (Table 2).Although some variance is observed,9 the presence of significant correlations is telling, given the small sample size (18).Furthermore, randomly permuting the color space centroids leads to RSA correlations that are non-significant for all setups (p > 0.05), which lends further credence to models' alignment with CIELAB structure.
Figure 3 shows the breakdown of correlations per color term for the three LMs under CC, as well as for fastText.We find that this ranking of color terms is largely stable across models and layer.Full RSMs for all models and CIELAB are in appendix C. The RSMs show evidence of the higher correlations for colors like violet, orange, and purple, being driven by general clusterings of similarity/dissimilarity.For instance, for both the CIELAB and CC BERT RSMs, violet's top nearest neighbors include purple, lavender, pink, and orange, and its furthest neighbors include aqua, olive, black, and gray.Correlations do not, however, appear to be driven by consistently aligned partial orderings within the clusters.In addition, we compute RSA correlations between the different  2).This validates our intuition that controlling for variation in sentence context would reveal increased alignment to color space.
Furthermore, we observe that, over the full range of probe complexities for the experimental condition and the control task (described as in §3), all models demonstrate high selectivity (see G for full results).It is, therefore, safe to attribute the fitness of the probes to information encoded in the color term representations, rather than to memorization.In terms of individual colors, Figure 4a depicts the ranking of predicted CIELAB codes per Munsell color chip for BERT (CC).We find that these results are largely stable across models and layers (see appendix F for full set of results and for reference chart).Also, we observe that clusterings of chips with certain modal color terms (green, blue) show worse rankings than the rest.

Analysis and Discussion
Having demonstrated the existence of models' alignment to CIELAB across various configurations, we now present an analysis and discussion of these results.part-of-speech category, dependency relation type, and word sense, is expressed in low-dimensional subspaces of language model representations (Reif et al., 2019b;Durrani et al., 2020;Hernandez and Andreas, 2021).We investigate the dimensionality of the subspace required to predict the CIELAB chip codes from the term embeddings, following the methodology of Durrani et al. (2020).Averaging over the three predicted CIELAB dimensions, we rank the linear mapping coefficients (from the experiments described in §2), sorting the weights by their absolute values in descending order.Results (appendix H) show that across models and layers, ∼0.4 of the variance in the CIELAB chip codes can be explained by assigning 95% of the weights to ∼10 dimensions.30-40 dimensions are sufficient to explain ∼0.7 of the variance, nearly the proportion of variance explained by the full representations (Table 1   chip in the color lexicon.Surprisal is defined as

Dimensionality of color subspace Previous work has shown that linguistic information such as
, where P (w|c) is the probability that a color c gets labeled as w and P (c|w) is computed using Bayes Theorem.Here, P (w) represents how often a particular word gets used across the color space (and participants), and P (c) is a uniform prior.Figure 4b shows surprisal per chip.High surprisal chips correspond to a lower color naming consensus among speakers, meaning that a more variable range of terms is used for these (color) contexts.We hypothesize that this could be reflected in the representations of color terms corresponding to high surprisal chips.To test this, we compute Spearman's correlation (ρ) between a chip's regression score (predicted color chip code ranking) and its surprisal.We find significant Spearman's rank correlation between lower ranking and higher surprisal for all LMs under all configurations (0.12 ≤ ρ ≤ 0.17, p < 0.05).

What factors predict color space alignment?
Given that LMs are trained exclusively on text corpora, we hypothesize that alignment between their embeddings and CIELAB is influenced by corpus usage statistics.To determine which factors could predict alignment score, we extract color term log frequency, part-of-speech tag (POS), dependency relation (DREL), and dependency tree head (HEAD) statistics for all color terms from a dependency-parsed (Straka et al., 2016) common crawl corpus.In addition to this, we compute, per color term, the entropy of its normalised PMI distribution (pmi-col, see §2) as a measure of collocation. 11We then fit a Linear Mixed Effects Model (Gałecki and Burzykowski, 2013) to the features listed above, with RSA score (Table 1) as the response variable, and model type as a random effect.
We follow a multi-level step-wise model building sequence, where a baseline model is first fit with color term log frequency as a single fixed effect.A model which includes pmi-col as an additional fixed effect is then fit, and these two terms are included as control predictors in all later models.Following this, we compute POS, DREL, and HEAD lemma distribution entropies per color term (pos-ent, deprel-ent, head-ent).Higher entropies indicate that the term is employed in more diverse contexts with respect to those categories.Following entropy computation, we separately fit models including each three entropy statistic features.Finally, we calculate the proportion of: POS tags that are adjectives, adj-prop; DRELs that are adjectival modifiers, amod-prop; and those that are copulas, cop-prop.The first two evaluate the effect of a color term occurring more or less often as an adjectival modifier, while the latter tests the hypothesis that assertions such as The banana is yellow could provide indirect grounding (Merrill et al., 2021), thereby leading to higher alignment.Including the entropy term which led to the best fit (deprel-ent) in the previous level, models are fit including terms for each of the proportion statistics.Model comparison is carried out by computing the log likelihood ratio between models that differ in a single term.See appendix J for model details.

Results show that:
• pmi-col significantly improves fit above log frequency and has a negative coefficient, meaning that terms that occur in more fixed collocations are less aligned to the perceptual space.Intuitively, this makes sense as the color terms in many collocations such as e.g.Red Army or Black Death are employed in contexts which are largely metaphorical rather than attributive or descriptive.
• deprel-ent and head-ent (but not pos-ent) lead to a significantly improved fit compared to the control predictors; we observe positive coefficients for both, indicating RSA score is higher for terms that occur in more varied syntactic dependency relations and modify a more diverse set of syntactic heads.This suggests that occurring in a more diverse set of contexts might be beneficial for robust representation learning, in correspondence with the idea of sample diversity in the active learning literature (Brinker, 2003;Yang et al., 2015).pos-ent's lack of significance, on the other hand, indicates that the degree of specification offered by the POS tagset might be too coarse to meaningfully differentiate between color terms, e.g.nouns can occur in a variety of DRELs such as subjects, objects, oblique modifiers (per the Universal Dependecies (Nivre et al., 2020)).
• out of the proportion statistics, only the amod-prop term improves fit; it has a positive coefficient, thus color terms occurring more frequently as adjectival modifiers show higher scores.adj-prop is not significant, providing further evidence for the POS tagset's level of granularity being too coarse.Finally, as cop-prop is not significant, it appears that occurring more frequently in assertion-like copula constructions does not confer an advantage in terms of alignment to perceptual structure.
Vision-and-Language models In a preliminary set of experiments, we evaluated multi-modal Vision-and-Language models (VisualBERT (Li et al., 2019) and VideoBERT (Sun et al., 2019)), finding no major differences in results from the text-only models presented in this study.

Related Work
Distributional word representations have long been theorized to capture various types of information about the world (Schütze, 1992).Early work in this regard employed semantic similarity and relatedness datasets to measure alignment to human judgements (Agirre et al., 2009;Bruni et al., 2012;Hill et al., 2015).Rubinstein et al. (2015), however, question whether the distributional hypothesis is equally applicable to all types of semantic information, finding that taxonomic properties (such as animacy) are better modelled than attributive ones (color, size, etc.).To a similar end, Lucy and Gauthier ( 2017) analyze how well distributional representations encode various aspects of grounded meaning.They investigate whether language models would "be worse off for not having physically bumped into walls before they hold discussions on wall-collisions?", finding that perceptual features are poorly modelled compared to encyclopedic and taxonomic ones.More recently, several studies have asked related questions in the context of language models.For example, Davison et al. (2019) and Petroni et al. (2019) mine LMs for factual and commonsense knowledge by converting knowledge base triplets into cloze statements that are used to query the models.In a similar vein, Forbes et al. (2019a) investigate LM representations' encoding of object properties (e.g., oranges are round), and affordances (e.g.oranges can be eaten), as well as the interplay between the two.Weir et al. (2020) demonstrate that LMs can capture stereotypic tacit assumptions about generic concepts, showing that they are adept at retrieving concepts given their associated properties (e.g., bear given A ___ has fur, is big, and has claws.).Similar to other work, they find that LMs better model encyclopedic and functional properties than they do perceptual ones.
In an investigation of whether or not LMs are able to overcome reporting bias, Shwartz and Choi (2020) extract all sentences in Wikipedia where one of 11 color terms modifies a noun and test how well predicted the color term is when it is masked.They find that LMs are able to model this relationship between concepts and associated colors to a certain extent, but are prone to over-generalization.Finally, Ilharco et al. (2020) train a probe to map LM representations of textual captions to paired visual representations of image patches, in order to evaluate how useful the former are for discerning between different visual representations.They find that many recent LMs yield representations that are effective at retrieving semantically-aligned image patches, but still far under-perform humans.

Outlook
It is commonly held that the learning of phenomena which rely on sensory perception is only possible through direct experience.Indeed, the view that people born blind could not be expected to acquire coherent knowledge about colors has been prevalent since at least the empiricist philosophers (Locke, 1847;Hume, 1938) and still holds currency (Jackson, 1982).Nevertheless, recent research highlighting the contribution of language and of semantic associations between concepts towards learning has demonstrated that the congenitally blind do in fact show a striking understanding of both color similarity (Saysani et al., 2018) and object colors (Kim et al., 2020).
This paper investigated whether representations of color terms that are derived from text only express a degree of isomorphism to the structure of humans' perceptual color space.12Results from our experiments evidenced that such a topological correspondence exists.Notably, color term representations based on simple co-occurance statistics already demonstrated correspondence; those extracted from language models aligned more closely.We observed that warm colors, on average, show more alignment than cooler ones, linking to recent findings on communication efficiency in color naming (Gibson et al., 2017).
Further analysis based on surprisal -an information theoretic measure, used to evaluate how efficiently a color is communicated between a speaker and a listener -revealed a correlation between lower topological alignment and higher color chip surprisal, suggesting that the kind of contexts a color occurs in play a role in determining alignment.Exploring this, we tested a set of color term corpus-derived statistics for how well they predict alignment, finding that a measure of a color term's collocationality corresponds to lower alignment, while the entropy of its dependency relation distribution and it occurring more frequently as and adjectival modifier correspond to closer alignment.
Our results and analyses present empirical evidence of topological alignment between text-based color term representations and perceptual color spaces.With respect to the debate started by Bender and Koller (2020), we hope that this work offers a modest step towards furthering our understanding of the kinds of "meaning" we expect language models to acquire, with and without grounded or embodied learning approaches, and that it will provide motivation for further work in this direction.

B RSA between models
Figure 5 shows a the result of representation similarity analysis between the representations derived from all models (and configurations) as well as CIELAB, showing Kendall's correlation coefficient between flattened RSMs.

E Corpus statistics
Figures 12 and 13 show log frequency and entropy of distributions over part-of-speech categories, dependency relations, and lemmas of dependency tree heads of color terms in common crawl.

F Linear mapping results by munsell color chip
Figure 14 shows linear mapping results broken down by Munsell chip for all models and configurations.

G Linear mapping control task and probe complexity
Figure 15 shows the full results over a range of probe complexities for the standard experimental condition as well the random control task.

H Dimensionality of color subspace
Figure 16 shows the proportion of explained variance with respect to the number of dimensions which are assigned 95% of the linear regression coefficient weights.

I Effect of model size
Our model size experiments are run using four BERT models of different sizes: BERT-mini (4 layers, hidden size: 256), BERT-small (4 layers, hidden size: 512), BERT-medium (8 layers, hidden size: 512), and BERT-base (12 layers, hidden size: 768).Further model specification and training details for the first three can be found in Turc et al. (2019) and for last in Devlin et al. (2019).

J Linear Mixed Effects Model
To fit Linear Mixed Effects Models, we use the LME4 package.With model type (BERT-CC, RoBERTa-NC, etc.) as a random effect, we follow a step-wise model construction sequence which proceeds along four levels of nesting: (i) in the first level color logfrequency is the only fixed effect, (ii) in the second pmi-colloc is added to that, (iii) in the third, each of pos-ent, deprel-ent, head-ent is added separately to the a model with log frequency and pmi-colloc, (iv) the term that leads to the best fit from the previous level deprel-ent is included, then each of the proportion terms adj-prop, amod-prop, cop-prop is added.The reported regression coefficients are extracted from the minimal model containing each term.

Figure 1 :
Figure 1: Right: Color orientation in 3d CIELAB space.Left: linear mapping from BERT (CC, see §2) color term embeddings to the CIELAB space.

Figure 2 :
Figure 2: Our experimental setup.In the center is a Munsell color chart.Each chip in the chart is represented in the CIELAB space (right) and has 51 color term annotations.Color term embeddings are extracted through various methods.In the Representation Similarity Analysis experiments, a corresponding color chip centroid is computed in the CIELAB space.In the Linear Mapping experiments, a color term embedding centroid is computed per chip.

Figure 3 :
Figure 3: RSA results (Kendal's τ ) broken down by color term for each of the LMs under the CC configuration and for the fastText baseline.
(a) Each circle on the chart represents the ranking of the predicted color chip when ranked according to Pearson distance from gold (larger circle ∼ = higher/better ranking).(b) Each circle on the chart represents a color chip's suprisal score (larger circle ∼ = higher score).

Figure 4 :
Figure 4: (a) shows linear mapping results for BERT, under the CC configuration, broken down by Munsell color chip; (b) shows suprisal per chip.Circle colors reflect the modal color term assigned to the chips.

Figures 6
Figures 6 to 9 show the representation similarity matrices employed for the RSA analyses, for the layer with the highest RSA score from each of the controlled-context (CC) models.

Figures
Figures 10 and 11 show Linear Mapping and RSA results broken down by color temperature.The color space is split according to temperature measured according to the Hue dimension in the Hue-Value-Saturation space 13 .

Figure 5 :
Figure 5: Result of representation similarity analysis between all models (and configurations), showing Kendall's correlation coefficient between flattened RSMs.Results are shown for layers which are maximally correlated with CIELAB, per model.-rc indicates random-context, -cc indicates controlled-context, and -nc indicates non-context.

Figure 11 :
Figure 11: RSA results (Kendall's τ ) broken down by color temperature for each for each of the baselines and the LMs.

Figure 12 :
Figure 12: Log frequency of color terms in common crawl.

Figure 13 :
Figure 13: Entropy of distributions over part-of-speech categories, dependency relations, and lemmas of dependency tree heads of color terms in common crawl.

Figure 14 :
Figure 14: Linear mapping results for each of the baselines and language models, under all extraction configurations, broken down by Munsell color chip.Each circle on the chart represents the ranking of the predicted color chip when ranked according to Pearson distance (1− Pearson's r) from gold -the larger the circle, the higher (better) the ranking.Circle colors reflect the modal color term assigned to the chips in the lexicon.Reference plot showing modal color of all chips also included.

Figure 15 :
Figure 15: Explained variance for the linear probes trained on the normal experimental condition (blue) and the control task (red) where color terms are randomly permuted.The means are indicated by the lines and standard deviation across layers is indicated by the bands.

Figure 16 :
Figure 16: The y-axis shows explained variance for the linear probes.The means are indicated by the lines and standard deviation across layers is indicated by the bands.The x-axis shows the number of regression matrix coefficients assigned 95% of the weight.

Table 2 :
Baseline results.RSA results show Kendall's τ ; results with * are significantly non-zero (p < 0.05).Linear mapping results show selectivity.models.Results show that NC embeddings have low alignment to all others (details in appendix B).For the linear mapping experiments, we observe the highest selectivity scores for CC (Table 1, right) compared to NC and RC (Table 1, left, middle) and baselines (Table ).
to correctly guess c, given w.Communication efficiency is measured through surprisal, S, which in this setting corresponds to the average number of guesses an optimal listener takes to arrive at the correct color chip.We calculate S(c) for each