The World of an Octopus: How Reporting Bias Influences a Language Model's Perception of Color

Recent work has raised concerns about the inherent limitations of text-only pretraining. In this paper, we first demonstrate that reporting bias, the tendency of people to not state the obvious, is one of the causes of this limitation, and then investigate to what extent multimodal training can mitigate this issue. To accomplish this, we 1) generate the Color Dataset (CoDa), a dataset of human-perceived color distributions for 521 common objects; 2) use CoDa to analyze and compare the color distribution found in text, the distribution captured by language models, and a human's perception of color; and 3) investigate the performance differences between text-only and multimodal models on CoDa. Our results show that the distribution of colors that a language model recovers correlates more strongly with the inaccurate distribution found in text than with the ground-truth, supporting the claim that reporting bias negatively impacts and inherently limits text-only training. We then demonstrate that multimodal models can leverage their visual training to mitigate these effects, providing a promising avenue for future research.


Introduction
Given sufficient scale, language models (LMs) 1 are able to function as knowledge bases, yielding factoids and relational knowledge across a wide range of topics (Petroni et al., 2019;Bouraoui et al., 2020). However, subsequent work (Bender and Koller, 2020;Bisk et al., 2020;Aroca-Ouellette et al., 2021) has raised concerns about the inherent limitations of text-only pretraining. Motivated by these concerns and limitations, we identify and investigate how reporting bias, a concrete and measurable signal, correlates with these limitations and how multimodal training can mitigate these issues. * *Email has no accent, but includes the hyphen. 1 In this paper, we use LM to refer to both causal LMs as well as masked LMs.
Everyone knows that most bananas are [MASK]. Grice's conversational maxim of quantity (Grice, 1975) asserts that utterances only contain the required amount of information. This leads to explicit reporting of self-evident knowledge being rare, while less common facts, properties, or events are being reported at disproportionately high frequencies. For example, while most people agree that bananas are typically yellow, the bi-gram "green banana" is 332% more frequent in the Google Books Ngram Corpus (Lin et al., 2012) than "yellow banana". 2 This reporting bias inevitably propagates from corpora to the models trained on them (Shwartz and Choi, 2020) and affects a variety of concepts. One such concept that we would expect to be harmful in downstream applications, is easy to measure, and is solvable via visual input is color. For these reasons, we investigate the relationship between reporting bias and modern LMs' perception of color.
People's understanding of color is primarily derived from their experience in the world. Every time we interact with an object, we update our understanding of the possible colors that object can take on. Further, we can often apply meaning to  Table 1: Object frequencies in each domain/dataset after filtering. We report class label statistics for Open Images and n-gram frequencies for Google Ngrams, Wikipedia, and VQA prompts.
the differences: a green banana is unripe, a yellow banana is ideal, and a brown banana may be past its prime. Text-only LMs do not share this embodied experience. Similar to an octopus 3 they cannot see colors, and need to rely solely on the inaccurate reporting of colors in text. Thus, we expect the colors LMs associate with objects to differ drastically from a human's perception.
To test this hypothesis, we construct the Color Dataset (CoDa) -a ground-truth dataset of color distributions for 521 well-known objects via crowdsourcing. We use this dataset to compare the color distributions found in text and those predicted from LMs, finding that a LM's shortcomings in recovering color distributions correlates with the reporting bias for those objects. Next, we hypothesize that models having access to multiple modalities, specifically vision and text, may be able to partially overcome these shortcoming by grounding the language to their limited visual experiences (Bisk et al., 2020). To this end, we develop a unified framework for evaluating the color perception of text-only and multimodal architectures. Our results support the hypothesis that multimodal training can mitigate the effect of reporting bias. Contributions We make three contributions: 1) We introduce a dataset with human color distributions for 521 well-known objects. 2) We conduct an extensive analysis to identify how reporting bias affects LMs' perception of color. 3) We demonstrate that multimodal training mitigates, but not eliminates, the impact of reporting bias.

Dataset Creation
Object Selection To ensure all our models -and potential future models -are properly exposed to the objects in our probing dataset, we choose objects which are common in both text and image data. We start with objects from the Open Images dataset (Kuznetsova et al., 2020) and remove all objects which appear less than 25 times in Wikipedia. For example, we remove "dog bed" as the corresponding bi-gram only appears 19 times. This leaves us with an initial set of 687 objects.
We then manually filter out all human-related words, such as "person" as well as hypernyms such as "food", since they are too general to assign specific colors. We also remove transparent objects, such as "windows", and objects that are more than two words long, such as "personal flotation device" and "table tennis racket". This leaves us with our final set of 521 objects. We provide object frequencies from Open Images V6 (Kuznetsova et al., 2020), the Google Books Ngram Corpus (Lin et al., 2012), Wikipedia, and VQA (Goyal et al., 2017) in Table 1.
Color Selection Following Berlin and Kay (1969), we choose the 11 basic color terms of the English language as the colors to be annotated: red, orange, yellow, green, blue, purple, pink, black, white, grey, and brown.
Color Annotation Due to sample bias in image datasets (Torralba and Efros, 2011) and the difficulty of matching pixel values to human perception, generating color distributions by counting color frequencies in images is impractical and challenging to verify. 4 Thus, in line with our focus on human perception of color as it relates to language (i.e., color terms), we approximate color distributions via human annotation crowdsourced on Amazon Mechanical Turk (MTurk). 5 Workers are shown words representing objects and tasked with rating -on a scale from 1 to 5the frequency with which instances of the objects appear in each of the 11 provided colors. We set up these tasks as human intelligence tasks (HITs), and provide the workers with instructions, which include an example for how one could label "grass" and a concrete list of acceptance and rejection criteria. Each HIT includes 25 objects and is compensated with $1. Fig. 2 shows the user interface as presented to an MTurk worker tasked with annotating the object "apples".
Since we choose objects that appear frequently in datasets, we expect people to be familiar with them. However, for the rare cases where an annotator is unsure about an object's color, our interface includes a skip button. The average crowdworker skips 1 object. If an object is not skipped, the average worker completes one annotation in 14 seconds on average. Each object's annotation is normalized to obtain a probability distribution over colors.
A potential side-effect of crowdsourcing annotations is that annotators might choose fewer colors to minimize the time spent on the task. In light of this, we design a labeling interface that balances the time required for labeling a given object as one, many, or all colors. For example, we include a "Select All" button and use wide click-optimized sliders. With these changes, we find that, on average, users tend to select 6.2 colors per object. For more details and analysis regarding annotator biases, we refer the reader to Appendix A.1.
Quality Control For quality control purposes, each HIT includes "spinach" as a control object at a random position within the group of objects to annotate. This control object serves as a way to flag any submissions which do not follow the instructions or are otherwise not suitable for our purposes. 6 We require the rating of "spinach" to be more than 50% green in order to accept the HIT. Rejected HITs are not included in the dataset. This 6 Annotators are made aware that control objects with known color distributions are included in the HIT.
filters out the small number of workers who provide random or blatantly incorrect annotations.
We compute the ground truth as an average over all submitted annotations for a given object. We iteratively filter annotations on a per-object basis if a rating has a Kendall correlation of less than 0 with the current ground truth. This removes 10 annotations that appear to be cases of annotator misinterpretation. For example, one annotator labels "stop sign" as being equally red, yellow, and green, likely confusing "stop sign" with "traffic light".

Group All Train Val Test Examples
Single 198   Object Grouping We are investigating the relationship between LMs' knowledge of object colors and reporting bias, the tendency of humans to not state the obvious (Grice, 1975). We hypothesize that reporting bias will be more severe for objects which have a single typical color, as that color will be implicitly assumed by a listener or reader and, accordingly, will be less frequently stated explicitly. In contrast, objects with a distinct set of several possible colors require explicit descriptions to fully capture the visual characteristics of the object. For example, apples are often described as red or green. To test whether objects with different color distri-butions are impacted by reporting bias differently, we divide the dataset into three categories: singlecolor objects, multi-color objects, and any-color objects. We categorize objects using k-means clustering with the Jensen-Shannon distance of sorted probabilities. This creates clusters which are colorinvariant and based only on the properties of the distributions. We find that this method gives consistent clusters, i.e. the clusters are independent of seeding. We then assign group names semimanually. 7 "Lemon" is an example of a single color object, where 73% of the distribution is yellow. "Wine" is a multi-color object with 90% of the distribution falling on red, white, pink, and purple (the last 10% is yellow

Templates
Text-only corpora and visually-grounded datasets rarely occupy the same domain. To accommodate both, we form a set of templates for each domain. The first is tailored to text-only models, and consists of both plural templates such as "Most bananas are [MASK]." and singular templates such as "This banana is [MASK]". Our second template group is tailored to visuallygrounded datasets. We use most of the templates provided by Radford et al. (2021), which the authors used for finetuning on ImageNet, but exclude templates that inherently point to an unnatural object state, such as "a photo of a dirty banana". Examples for templates are provided in Table 3.
We recognize that any hand-crafted templates are by nature imperfect. As such, we use all configurations for all models and present the best results per-object for each model to give models ample opportunity to succeed.

Data Splits
Some of our experiments (cf. Section 4.2) require a small training set. Thus, CoDa contains training, development and test splits, with 311, 103, and 106 objects respectively. There is no object overlap between the different sets.
3 Reporting Bias

Background
As previously stated, Grice's conversational maxim of quantity manifests as reporting bias -i.e., people not usually stating obvious facts or properties -, and impacts nearly all datasets that contain text.
Reporting bias has been studied in the context of both NLP and image captioning. Gordon and Van Durme (2013) perform a quantitative analysis using n-gram frequencies from text, finding this phenomenon particularly relevant to internet text corpora. Shwartz and Choi (2020) extend these experiments to pretrained models such as Bert (Devlin et al., 2019) and RoBERTa (Liu et al., 2019). Similar to our work, they analyze color attribution of the form "The banana is tasty." However, their ground truth is extracted from Wikipedia bi-grams and, thus, suffers from reporting bias itself. In contrast, we circumvent this problem by collecting the ground truth in CoDa directly from humans.

Reporting Bias in Text
Our hypothesis is that pretrained LMs inherit reporting bias with respect to colors from their training data. Thus, prior to our main experiments, we investigate if, in fact, reporting bias exists in large general text corpora. We analyze the Google Books Ngram Corpus (Lin et al., 2012) and Wikipedia. Specifically, we look at all bi-grams and tri-grams containing a color followed by an object in our dataset.
Let us denote the count of the n-gram x 1 . . . x n as φ(x 1 , . . . , x n ). We then define the relative frequency with which each object o appears with a color c as: We further define the probability of an object being of color c * as:  Table 4: Correlation metrics between the n-gram frequencies reported in different datasets and the ground truth distributions collected from human annotators. Single, Multi, and Any indicate sets of objects that are frequently a single color, between two to four colors, or could be any color, respectively. We aggregate by object and report the mean ± standard deviation for each metric across the objects of that group.  The results of these experiments are reported in Table 4. The frequency column supports our hypothesis that objects with one typical color are less frequently described as being of any color than those with multiple typical colors or where any color is possible. In all metrics excluding Acc@1, the text-retrieved color distributions are more strongly correlated with the ground truth for multi and any colored objects than for single-colored objects. 8 4 Experimental Setup

Zero-shot Probes
We first probe LMs in a zero-shot fashion using a set of templates (see Section 2.2). Each template has a [MASK] where the color should appear. For models trained using a causal language modeling objective, we run the models over each template eleven times, each time with a different color replacing the [MASK] token. Following Warstadt et al. (2020), we select the sentence with the highest probability. For models trained using a masked language modeling objective, we filter the output vocabulary to only include the eleven color choices and normalize to obtain a probability distribution.

Representation Probes
Many current multimodal architectures are optimized for multimodal evaluation and have complex shared embedding spaces, which makes it challenging to compare to text-only models. However, recent developments such as CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021) show promising results in connecting images and text via contrastive pretraining on large unlabeled corpora, while still maintaining separate text and image models. We focus on probing multimodal models which follow these architecture decisions. Since they have not been trained on a language modeling objective, zero-shot probing is not viable on these models. To overcome this and enable comparison to textonly models, we freeze the base model and use part of our dataset to train a MLP to extract color distributions from the frozen representations. Given pretrained representations, we would like the performance of a model to consist of 2 parts: final quality (in our case distribution correlations), and the amount of effort to get that quality from the representations. This is possible by formulating the task as efficiently learning a model from representations to color distributions. Following Whitney et al. (2021); Voita and Titov (2020), we conduct our experiments for representation probing in a loss-data framework using minimum description length (MDL), surplus description length (SDL), and ε sample complexity (εSC). We split the training set into 10 subsets spaced logarithmically from 1 to 311 objects, and report averages over 5 seeds.

Models
We probe object-color probabilities in 14 pretrained text-only models as well as four versions of CLIP  Table 6: LM results when probed in a zero-shot setting. Single, Multi, and Any indicate sets of objects that are frequently of a single color, between two to four colors, or could be any color, respectively. All correlation coefficients (ρ, τ ) are multiplied by 100. For each object, we take the prediction from the template with the highest τ correlation. We then aggregate by object and report the mean ± standard deviation over objects of that group. We report the results from the best model from each architecture; for results on a per-model basis, see Table 9. (Radford et al., 2021). The text-only models are varied configurations of GPT-2 (Radford et al., 2019), RoBERTa (Liu et al., 2019), and ALBERT (Lan et al., 2020); cf. Table 5 for the full set. We use Huggingface's (Wolf et al., 2019) pretrained models for all text-only models and the official implementation of CLIP. 9

Metrics
In order to obtain as comprehensive a picture as possible, we report a variety of metrics when applicable, including: top-1 accuracy, Spearman rank order correlation ρ, Kendall rank correlation τ , and Jensen-Shannon divergence D JS for each model and each set of objects. Each of these metrics highlight slightly different aspects of performance on the task. Top-1 accuracy (Acc@1) is the frequency with which models can correctly identify the most frequent color of an object. This is useful for comparing models, but not directly interpretable across object groups as it inherently favors objects that can take on few colors. Spearman's ρ is sensitive to outliers, so it highlights the extreme mistakes, while Kendall's τ is more robust to such changes. Jensen-Shannon divergence measures the similarity between 2 distributions. Spearman's ρ and Kendall's τ are within the range of [−1, 1], with -1 being negatively correlated and 1 being perfectly correlated. 10 We additionally define ∆ρ and ∆τ correlation difference measures defined on the interval [-100, 100], to compare model predictions to n-gram frequency predictions. This measures the difference in corre-9 github.com/openai/CLIP 10 We multiply by 100 in all tables for legibility. lation between n-gram frequency predictions and a model's probability distribution, where -100 indicates degraded correlation, 0 equals perfect correlation, and 100 indicates improved correlation with the ground truth as compared to the relative n-gram frequencies. In the context of reporting bias, ∆ρ and ∆τ can be interpreted as measures of bias amplification or mitigation for negative and positive values, respectively.
We additionally define an average of the two correlation metrics as "Avg. Correlation". When using this metric, we first compute ρ+τ 2 for a specific object and perform all other aggregations in the same way as for the other metrics.  Figure 3: Correlation between n-gram frequency and LM performance for single, multi, and any color objects. X and Y axes are Kendall's τ correlation between n-gram frequency and ground truth and LM predictions with ground truth respectively. Each point corresponds to a single object in our dataset. LM correlation is averaged over the top models for each architecture. The dotted line y=x corresponds to to perfect correlation.

Zero-Shot Probes
The results of LMs when probed in a zero-shot setting, provided in Table 6, clearly demonstrate that LMs perform worse on single-color objects and perform better on objects that can take on a range of colors. Furthermore, correlations are relatively low for all objects and models. This demonstrates that colors are generally challenging for state-of-the-art pretrained LMs. Figure 3 compares the correlation between n-gram frequency and zero-shot LM performance. The identity line represents a theoretical perfect correlation between how well n-gram frequency correlates with our ground truth and LM predictions. 11 Any points above the identity line represent cases where LMs seem to mitigate reporting bias -their predictions are closer to ground truth, and points below the line represent cases where LMs amplify reporting bias -their predictions are further from ground truth. When averaged across all models (see Appendix C for the full list of results) zero-shot LMs amplify the reporting bias of single-color objects by 5.23% on average, and 6.26% for multi-color objects. For any-color objects, we find a slight mitigation of 0.21% on average. Table 7 aggregates and combines results from Tables 4 and 6 and elucidates two main points on the effect of reporting bias on a LM's perception of color. First, the color distributions of LMs correlate more strongly with reporting bias-affected text than with a human's perception of color. Second, single-colored objects are the most affected by reporting bias, and the objects LMs struggle the most on. These results indicate that, in line with our hypothesis, LMs are negatively impacted by reporting bias. Further, because reporting bias is innate to human communication and due to the enormous amount of text required for modern LMs, it is infeasible to eliminate reporting bias from all training data. This entails -in support of the arguments in Bender and Koller (2020) and Bisk et al. (2020) -that language understanding abilities are naturally limited by text-only training.   Here we compare the average correlation between LM predictions and two sources of "ground truth"; one collected from human annotators and one computed from n-gram frequencies.

Representation Probes
Single, Multi, and Any indicate sets of objects that are frequently of a single color, between two to four colors, or could be any color, respectively. The "Freq." column indicates the frequency n-grams containing these objects also have one of the eleven colors.  tion of the number of training objects. Note that with 14 objects, all models surpass zero-shot performance in terms of Jensen-Shannon divergence.
With enough training objects, we observe similar ranking patterns observed in the zero-shot setting for text-only models. However, the advantage of this approach is that we can additionally include multimodal architectures.
The results from these experiments demonstrate that multimodal models outperform text-only models at recovering color distributions. They manage to do so even though the performance of multimodal models is often lower on classic NLP tasks (Tan and Bansal, 2020) and many multimodal datasets are even more prone to reporting bias in text (Misra et al., 2016;van Miltenburg, 2016;Burns et al., 2018). This further support the arguments in Bisk et al. (2020) that understanding concepts requires experiencing them in their natural form.

Jensen Shannon Divergence
GPT-2 R oBER T a ALBER T CLIP Random Figure 4: Representation probing results for unseen objects with varying amounts of data, averaged over 5 seeds.

Representation
The main lines are the best model from those of the same type (e.g., RoBERTa BASE and RoBERTa LARGE ), and the translucent lines are the per-model averages. Dotted lines represent best zero-shot performance for each model. The "Random" group consists of a randomly initialized RoBERTa and CLIP. The black dotted lines correspond to and n in Table 8. Left: Average of Spearman's ρ and Kendall's τ . Right: Jensen-Shannon divergence.

Limitations
While our work identifies issues with text-only training and motivates the use of multimodal signals during pretraining, in this section we outline some limitations of our approach. First, a number of recent papers have highlighted potential limitations of probing LMs in certain ways (Zhang and Bowman, 2018;Whitney et al., 2021). While we acknowledge that probing does not provide a full picture of the capabilities of LMs, our hypothesis was supported by a range of different results from different approaches. In future work, we hope to leverage research (Bouraoui et al., 2020;Jiang et al., 2020) that demonstrates effective methods for automatically producing templates optimized for specific models. In the current state, we cannot and do not state exactly what LMs do and do not capture, rather we use our results to uphold and strengthen our original hypothesis that reporting bias hinders performance and that multimodal signals can help mitigate this problem.
Second, the bi-gram/tri-gram approach we use to quantify reporting bias only approximates the full set of object-color instances. To be more exact, a dependency parser would have to be run on every dataset.
Finally, although our results motivate the use of multimodal signals during pretraining, there are still challenges to overcome. As discussed by Tan and Bansal (2020), the performance of multimodal models on classic NLP tasks often does not reflect the inherent advantages of these architectures, and many multimodal dataset are even more prone to reporting bias in text (Misra et al., 2016;van Miltenburg, 2016;Burns et al., 2018). Further, while a visual signal is able to better impart a sense of color, it is not enough to endow models with the meaning behind those colors. Humans easily learn that a green banana is not yet ripe, and that a brown banana is past its prime. For models to obtain this level of knowledge and reasoning they will likely require training signals from more modalities, and potentially fully embodied experiences.

Related Work
Color-Object Relationships Preexisting word association datasets often include object-color relationships as either having multiple equally likely pairings (Gladkova et al., 2016;Kucera and Francis, 1967), or as probabilistic cue-target pairs (Nelson et al., 2004). Others such as Devereux et al. (2013) take a norm completion approach, wherein participants are tasked with generating attributes given some concept. One can then extract the objectcolor relationships by counting the number of participants who reported a given color.
However, the resulting "distribution" is an aggregate count over individuals, and does not necessarily reflect the distribution from the eyes of a single observer. Thus, previous research into LMs as knowledge bases has not been able to fully explore the extent to which they know color (A. Rodriguez and Merlo, 2020;Shwartz and Choi, 2020).
Previous work has shown the importance of color in visual perception and object recognition (Rosenthal et al., 2018;Gegenfurtner and Rieger, 2000). More recently Teichmann et al. (2020) use time resolved neural imaging data to demonstrate how the typicality of object-color relationships influences object representations in visual processing.
Probing LMs A wide range of papers have probed LMs in a zero-shot fashion by looking at how they fill in a [MASK] token in handcrafted (Weir et al., 2020;Petroni et al., 2019;Jiang et al., 2020;Ettinger, 2020;Lin et al., 2020) or automatically generated (Bouraoui et al., 2020;Jiang et al., 2020) template sentences. Others, such as Warstadt et al. (2020) compare perplexities between minimal pairs of sentences. A different approach is to analyze the representation quality of LMs for linguistic tasks by training a simple MLP on pretrained model representations (Da and Kasai, 2019;Lin et al., 2019). However, Zhang and Bowman (2018) demonstrate that the procedure of training an additional classifier may distort the results. An alternative approach introduced by Voita and Titov (2020) is information-theoretic probing with MDL. This method builds on standard probing classifiers by not only measuring the final performance, but additionally measuring the amount of effort required to achieve that performance.
Probing Multimodal LMs Often multimodal LMs are used in the domain of visual question answering, where, given an image, the model is asked a question about concepts in the image (Goyal et al., 2017;Hudson and Manning, 2019). While it is often possible to simply use the text-only portion of these models for other tasks, this often leads to poor performance on solely language-based tasks (Tan and Bansal, 2020).

Conclusion
In this paper we investigate how reporting bias negatively effects a LM's perception of color. We do so by first creating CoDa, a dataset of 521 humanperceived color distributions for common objects. We then utilize this dataset to demonstrate that text-only models are inherently limited because of reporting bias. Subsequently, we show that multimodal training mitigates these issues. Overall, our results support the claims in Bender and Koller (2020) and Bisk et al. (2020)

A.1 Analysis of Annotator Biases
A potential side-effect of crowdsourcing annotations is that annotators might be biased toward choosing fewer colors faster, as this would equate to higher monetary incentives. We observe a small correlation (Kendall's Tau=0.154, p=0.026) between the total time and number of colors selected. However, this is to be expected as selecting the colors takes time.
All models we evaluate were predominately trained on English text. To accommodate this domain and minimize dataset variance, we recruit only annotators from the United States. This may induce cultural or geographic biases: e.g., the color diversity of carrots is much smaller in the United States than in some Asian countries. Other geographic biases are more fine-grained; for example, the color of fire hydrants in the U.S. depends on where you live and the water source.
Additionally, our choice of colors is not as universal as, for example, the 6 color terms defined by The World Color Survey (Kay et al., 2009). The latter may be more suitable for multilingual studies, though we leave such investigations for future work.

B Experimental Details
For all experiments, we implement the CoDa dataset using the Huggingface Datasets Library. We use Huggingface's (Wolf et al., 2019) pretrained models for evaluating all text-only models, and the official CLIP implementation by Radford et al. (2021) for all CLIP models. 12 We run all experiments on a single machine with one Nvidia Titan RTX GPU.

B.1 Representation Probing
Our representation probing implementation is derived from the efficient JAX version provided by Whitney et al. (2021). 13 We split the training set into 10 subsets spaced logarithmically from 1 to 311 objects, and report averages over 5 seeds. Note that for each seed, any additional points along the curve represent additional objects to the previous subset, however, different seeds have different object sets and thus a different number of samples per subset. For our dataset, we found the difference in samples to be far less impactful on performance than the number of objects.
All probes are 2-layer MLPs with ReLU activation functions and are trained using the Adam Optimizer (Kingma and Ba, 2015) with a learning rate of 10 −4 . All probes are trained for 4000 steps. More details on how to reproduce the experiments are provided in our GitHub repository. 14 12 github.com/openai/CLIP 13 github.com/willwhitney/reprieve 14 github.com/nala-cub/coda

C Zero Shot Results
The zero-shot results for all evaluated LMs are provided in Table 9.  Table 9: LM results when probed in a zero-shot setting. Single, Multi, and Any indicate sets of objects that are frequently a singe color, between two to four colors, or could be any color, respectively. All correlation coefficients (ρ, τ ) are multiplied by 100. Means and standard deviations are calculated over objects of the respective group.