Abstract Visual Reasoning with Tangram Shapes

We introduce KiloGram, a resource for studying abstract visual reasoning in humans and machines. Drawing on the history of tangram puzzles as stimuli in cognitive science, we build a richly annotated dataset that, with >1k distinct stimuli, is orders of magnitude larger and more diverse than prior resources. It is both visually and linguistically richer, moving beyond whole shape descriptions to include segmentation maps and part labels. We use this resource to evaluate the abstract visual reasoning capacities of recent multi-modal models. We observe that pre-trained weights demonstrate limited abstract reasoning, which dramatically improves with fine-tuning. We also observe that explicitly describing parts aids abstract reasoning for both humans and models, especially when jointly encoding the linguistic and visual inputs.


Introduction
Reference is a core function of natural language that relies on shared conventions and visual concepts.For example, in English, a speaker may use the term dog to refer to a particular animal of the species canis familiaris, or, through abstraction, to an object with a less strongly conventionalized name, such as the shape at the top of Figure 1.A speaker might refer to such a shape as looking like a dog, and even point to its parts, like its head and tail, despite having few visual features in common with the ordinary referent.
Comprehension and generation of references are critical for systems to engage in natural language interaction, and have been studied extensively with focus on ordinary references (e.g., Viethen and Dale, 2008;Mitchell et al., 2010;FitzGerald et al., 2013;Mao et al., 2016;Yu et al., 2016), in contrast to the visual abstraction illustrated in Figure 1.
* Equal contribution, alphabetically ordered.We address this gap by adopting an influential paradigm for probing human coordination in the cognitive science literature: reference games with abstract tangram shapes (e.g.Clark and Wilkes-Gibbs, 1986;Fox Tree, 1999;Hawkins et al., 2020).
Unlike photographs of natural objects, where there is often a single canonical label, tangrams are fundamentally ambiguous.While some shapes fall under strong existing conventions and elicit consensus about appropriate names (e.g., Figure 1, top), others are characterized by weaker conventions (e.g., Figure 1, bottom) and every speaker may arrive at a distinct but valid description (Zettersten and Lupyan, 2020;Hupet et al., 1991).While such diversity is a key consideration motivating their use as stimuli, existing behavioral studies have typically been limited to a relatively small set of 10-20 shapes, highly restricting the overall diversity of the stimulus class.It also limits their applicability for training and analyzing vision and language models, where significantly more data is necessary.
In this paper, we significantly expand this resource.We introduce KILOGRAM, 1 a large collec-tion of tangrams with rich language annotations.KILOGRAM dramatically improves on existing resources along two dimensions.First, we curate and digitize 1,016 shapes, creating a set that is two orders of magnitude larger than collections used in existing work.This set dramatically increases coverage over the full range of naming variability, providing a more comprehensive view of human naming behavior.Second, rather than treating each tangram as a single whole shape, our images are vector graphics constructed from the original component puzzle pieces.This decomposition enables reasoning about both whole shapes and their parts.
We use this new collection of digitized tangram shapes to collect a large dataset of textual descriptions, reflecting a high diversity of naming behaviors.While existing work has focused on naming the complete shape, we also ask participants to segment and name semantically meaningful parts.We use crowdsourcing to scale our annotation process, collecting multiple annotations for each shape, thereby representing the distribution of annotations it elicits, rather than a single sample.In total, we collect 13,404 annotations, each describing a complete object and its segmented parts.
The potential of KILOGRAM is broad.For example, it enables the data-driven scaling of studies of human interactions and models of whole-part reasoning in language and vision models.In this paper, we use KILOGRAM to evaluate the visual reasoning capacities of recent pre-trained multi-modal models, focusing on generalizing concepts to abstract shapes.We observe limited generalization of this type in pre-trained models, but significant improvements following fine-tuning with our data.We also see how explicitly referring to and visualizing parts can help reference resolution.Data and code, as well as a data viewer are available at: https://lil.nlp.cornell.edu/kilogram/.

Background and Related Work
Abstract or ambiguous visual stimuli have been widely used to investigate how human partners coordinate when talking about things in the absence of strong naming conventions going back to Krauss and Weinheimer (1964).Tangrams as stimuli were introduced by Clark and Wilkes-Gibbs (1986).These shapes are all built from the same seven primitives, but elicit a wide range of figurative descriptions that conceptualize shapes in different ways (Schober and Clark, 1989;Hor-ton and Gerrig, 2002;Duff et al., 2006;Holler and Wilkin, 2011;Horton and Slaten, 2012;Ibarra and Tanenhaus, 2016;Shore et al., 2018;Atkinson et al., 2019;Castillo et al., 2019;Bangerter et al., 2020).It has been observed that some shapes are easier or harder to describe (Hupet et al., 1991;Zettersten and Lupyan, 2020;Brashears and Minda, 2020), a property known as nameability or codability, which has also been studied with non-tangram shapes (e.g., line drawings; Snodgrass and Vanderwart, 1980;Cycowicz et al., 1997;Duñabeitia et al., 2018).Even though diversity is a key consideration in working with tangrams, existing stimuli sets are relatively small, limiting their usefulness as NLP benchmarks, where scale is critical.Even the largest studies of variability in naming (e.g., Murfitt and McAllister, 2001) have used a relatively small set of 60 tangrams.Fasquel et al. (2022) present a resource that is related and complementary to ours, including 332 PNG-formatted tangrams with whole-shape naming annotations in French.
Contemporary pre-trained vision and language approaches can be categorized along an axis characterizing how they encode the data, from jointly encoding the two inputs (Lu et al., 2019;Chen et al., 2020;Kim et al., 2021) to encoding them separately (Radford et al., 2021;Jia et al., 2021).Joint encoding aims to capture tighter interaction between the input modalities compared to separate encoding, but is generally more computationally expensive, and can only operate on multi-modal input.We study recent models on both ends: ViLT (Kim et al., 2021) for joint encoding and CLIP (Radford et al., 2021) for separate encoding.
These models are typically evaluated on image captioning (e.g., Chen et al., 2015) or visual question answering (e.g., Antol et al., 2015) benchmarks.Several benchmarks, such as NLVR (Suhr et al., 2017(Suhr et al., , 2019) ) and Winoground (Thrush et al., 2022), aim for more focused evaluations with a focus on compositionality.We build on these efforts, but target generalization through abstraction using visually ambiguous stimuli.This is inspired by the role of abstraction in human cognition.Abstraction is a key step in human perception (Biederman, 1987) that is critical for generalization (Gentner and Markman, 1997;Medin et al., 1993;Shepard, 1987), and forms the shared foundation on which human language communication is layered (Lupyan and Winter, 2018;McCarthy et al., 2021;Wong et al., 2022).Our focus on part de- composition is aligned with how part identification plays an important role in human abstraction (Tversky and Hemenway, 1984).

Data Collection
We scan a large set of tangram puzzles to vector graphics, and crowdsource annotations of natural language descriptions and part segmentations.

Collecting Tangram Puzzles
Tangram puzzles are made of seven primitive shapes (Elffers, 1977), which can be combined in a large variety of configurations evoking different concepts.We scan 1,004 tangrams depicting a broad set of concepts to vector graphic SVGs from Slocum (2003).Appendix A.1 shows example tangrams, Appendix A.2 details on our process. 2We also manually add 12 tangrams commonly used in previous studies (Hawkins et al., 2020).

Whole-Part Annotation
We design a two-stage crowdsourcing task to elicit natural language English descriptions for each tangram, both of the whole shape and of its parts (Figure 2).First, in the whole-shape description stage, the worker is shown a tangram image in grayscale and asked to complete the prompt "This shape, as a whole, looks like ____."In the part annotation stage, the worker is asked to select one or more puzzle pieces, and complete the prompt "The part(s) you selected look(s) like ____."These pieces are then colored and the annotation appears in the corresponding color.The annotator can delete annotations, annotate a part as UNKNOWN when they are not sure about its semantics, and add pieces to existing parts.All pieces must be annotated to submit the task, yielding a complete segmentation map.
We use Amazon Mechanical Turk for data collection.Workers are required to be located in the United States with at least a 98% HIT acceptance rate, must pass a qualification task, and complete a survey about their language proficiency (see Appendix A.3 for further details).To prevent a small group of workers from dominating the data, each annotator is only allowed to annotate each tangram once, and cannot annotate more than 200 distinct tangrams.Workers are paid 0.14 USD per task. 3e first collect 10,053 annotations for the 1,004 scanned tangrams, at least 10 annotations for each tangram (mean=10.01).Following this stage of annotation, we collect additional annotations for a subset of the tangrams to create a set with denser language and part segmentation annotation.We sample 62 tangrams to be representative of the different levels of diversity in annotations we observe in the initially collected data.Appendix A.4 describes the sampling procedure.We also add the 12 tangrams from previous studies for a total of 74 tangrams for dense annotation.We conduct additional annotation tasks to have at least 50 annotations for each of the 74 tangrams selected for dense annotation (mean=53.66). 4The dense annotation gives us a better estimate of the distribution of language for the 74 selected tangrams, for example to use as reference texts in generation tasks.
In total, we collect 13,404 annotations for 1,016 tangrams at a total cost of 2,172.94USD.We lowercase and stem to compute vocabulary size, and tokenize on white spaces to compute description length. of 297 MTurk workers participate in the annotation, with 98.0% of the workers speaking English as their first language.Those who do not speak English as their first language still rate their English proficiency level as native or close to native.1.0% of the workers speak more than one language, among which the most common are Spanish, German, Japanese, and Chinese.

Standard Data Splits
We split the dataset for analysis and learning experiments.For analysis, we create two overlapping sets: FULL and DENSE.FULL includes 1,016 tangrams, each with 10-11 annotations (mean=10.11).It includes the 10,053 annotations initially collected for the scanned 1,004 tangrams.For the 12 commonly used tangrams, we sample 10 annotations from the later collection effort.DENSE includes all annotations for the 74 densely annotated tangrams, with at least 50, and 53.66 on average annotations per tangram.We also define the set DENSE10 to include only the annotations from the sparse set for the densely annotated tangrams.For learning experiments, we split according to tangrams to create training (692 tangrams), development (125), test (125), and test-dense sets (74).All densely annotated tangrams are in test-dense.The other three sets are split randomly.

Data Analysis
The language and concepts annotators use reflect varying degrees of consensus around conventions for describing the appearance of shapes and their parts.For analysis, we preprocess the annotations by lowercasing, tokenizing, lemmatizing, and removing stop words using NLTK (Bird, 2004).We use the larger FULL set for our analyses (Section 3.3), unless otherwise noted.
For a broad overview of the types of concepts evoked, we manually tag 250 randomly sampled annotations: 30.8% use human-like concepts (e.g., dancer), 31.2%animate but non-human concepts (e.g., dog), and 38.0% non-animate concepts (e.g., house).We examine how part words differ across whole-shape concepts by extracting head words from whole-shape and part descriptions.Figure 3 shows the distribution of part head words for each of 272 whole-shape head words with >10 occurrences, ranked in order of frequency.Figure A.2 in the appendix illustrates how the most common part word head is used in different tangrams.
A central problem of visual abstraction is the degree of ambiguity or subjectivity that a shape evokes across different people (Murthy et al., 2022): some descriptions have higher consensus than others.We define three measures of variability along different dimensions: shape naming divergence (SND), part naming divergence (PND), and part segmentation agreement (PSA).Table 2 lists the mean and standard deviation for these three measures over the sparsely and densely annotated data.
Shape Naming Divergence (SND) A tangram's SND quantifies the variability among whole-shape annotations.SND is an operationalization of nameability, a criteria that is commonly used to measure how consistent is naming of an object across individuals (e.g., Zettersten and Lupyan, 2020).
Formally, a whole-shape annotation is a sequence of M tokens x = ⟨x 1 , . . ., x M ⟩.Given a tangram with N annotations x(j) , j = 1, . . ., N , each of length M (j) , we define w (j) i for each token x (j) i in annotation x(j) as the proportion of other annotations of that tangram that do not contain x (j) i : where 1 is an indicator function.The divergence For example, the SNDs of the tangrams in Figure 1 computed only with the two annotations displayed are 0.00 (top) and 1.00 (bottom).
Mean SND is relatively high in our data, with 0.91 on FULL (Table 2).We observe relatively similar values for DENSE and DENSE10, albeit with lower standard deviation for DENSE, as expected with more annotations.Annotators often use words that are unique to their annotation.We observe perfect consensus for only one tangram, and mostly similar annotations with relatively few deviations for a few others.Figure 5   collected in the second step of the annotation task.PND is computed identically to SND, but with the concatenation of all part names of an annotation as the input text x.For example, the PNDs of the two tangrams in Figure 1 computed with only the two annotations displayed are 0.19 (top) and 1.00 (bottom).In general, part descriptions are more similar than whole-shape descriptions with mean PND of 0.76 (Table 2).
Part Segmentation Agreement (PSA) Annotators segment the tangrams into parts by grouping the tangram puzzle pieces.PSA quantifies the agreement between part segmentations as the maximum number of pieces that does not need to be moved to another group in order to edit one segmentation to another.We compute PSA as a linear sum assignment problem with maximum weight matching.For each pair of segmentations, we create a cost matrix, where the number of rows is the number of parts in one annotation and the number of columns is the number of parts in the second annotation.The value of each matrix element is the number of matching puzzle pieces between the two corresponding parts in the two annotations.The tangram PSA is the mean of costs for all annotation pairs.For example, the PSAs of the two tangrams in Figure 1 computed with only the two annotations displayed are 6.00 (top) and 3.00 (bottom).
The mean PSA in our data is 5.30 (Table 2), with an approximately normal distribution of values.Some tangrams have strong segmentation cues, such that annotators reach perfect consensus, while others elicit significant segmentation disagreement.

Dense Annotations
The comparison of FULL, DENSE, and DENSE10 illustrates how well our data approximates the real distribution of annotations for each tangram, and the advantage of DENSE. Figure 4 shows the complete distribution of values.Comparing DENSE10 and DENSE, the rankings of the tangrams are largely the same with the additional annotations: for SND, Spearman's rank correlation coefficient is r(72) = .78,p ≪ .001;for PND, r(72) = .87,p ≪ .001;for PSA, r(72) = .76,p ≪ .001.The tangrams sampled for DENSE represent well the distribution of tangrams along the different measures, as illustrated by the red highlights in Figure 4.
Inter-measure Correlations Figure 5 illustrates the correlations between the three measures.The divergences of the two types of language annotations, whole-shape and part descriptions, show moderate positive correlation r(1014) = .531,p ≪ .001.This indicates that tangrams that are annotated with similar whole-shape descriptions are often annotated with similar part descriptions.

Visual Reasoning with Tangrams
We use KILOGRAM to evaluate the reasoning of CLIP (Radford et al., 2021) and ViLT (Kim et al., 2021) through a reference game task, where the model is given a textual description and selects the corresponding image from a set of images.Formally, given a textual description x and a set of k images I = {I 1 , . . ., I k }, the task is to select the image I i ∈ I corresponding to x.We cast the task as computing a similarity score f (x, I i ) between the description x and an image I i .We select the corresponding image as I * = arg max I i ∈I f (x, I i ).

Reference Game Generation
We randomly generate reference games for an annotated text-image pair (x, I) by sampling additional k − 1 images from data under several constraints.We do not include repeating images in the set of k images or images that have identical whole-shape text annotations.This avoids obvious ambiguity that is impossible to resolve in the target selection.
We also require all images to be annotated with the same number of parts.This reduces the chance of the model relying on simple part counting to discriminate between target images when including parts in the text (condition PARTS below).Appendix A.8 shows the impact of these constraints through analyzing experiments not using them.

Models
We instantiate f using CLIP or ViLT, two models based on the Transformer architecture (Vaswani et al., 2017).We provide a brief review of the models, and refer the reader to the respective papers for further details.CLIP uses two separate encoders to generate separate fixed-dimension representations of the text and images.It uses contrastive pre-training with a symmetric cross entropy loss on a large amount of aligned, but noisy web image-text data.We implement the scoring function f with CLIP by encoding the text x and all images I ∈ I separately, and then computing the dot-product similarity score of the text with each image.This is identical to the CLIP pre-training objective, which potentially makes CLIP suitable for our task out of the box.
ViLT uses a single encoder that takes as input both the text and image inputs together.ViLT pre-training also uses aligned image-text data, but from existing benchmarks (Lin et al., 2014;Krishna et al., 2016;Ordonez et al., 2011; Sharma  , 2018).It is pre-trained using multiple selfsupervised objectives, including image-text matching via a binary classification head, which is suitable for our task out of the box.We implement f using this classification head.Given a text x and an image I ∈ I, we compute their similarity using the matching classification head.

Experimental Conditions
We study several input variants.Figure 6 illustrates the modalities under the different conditions, and Appendix A.5 shows complete example inputs.For the textual description x, we experiment with including the whole-shape description only (WHOLE) or adding part names (PARTS) by combining with the whole-shape description using the template <whole shape> with <part>, <part>, ..., and <part>.This tests the ability of models to benefit from part names.We consider two image I conditions: coloring all parts with the same color (BLACK) or coloring parts differently (COLOR).The color choice in COLOR corresponds to the position of the part name in x, when the text includes part names (PARTS).
We experiment with the original pre-trained model weights, and with contrastive fine-tuning on our data using a symmetric cross entropy loss (Radford et al., 2021).During fine-tuning only, we consider a data augmentation condition (AUG), where we augment the data by creating examples that include only a subset of the part names in the text and coloring only the parts corresponding to the included parts names in the image, while all other parts remain black.We generate partial part examples for all possible subsets of parts for each example.Appendix A.5 illustrates the generated examples.When generating reference games for the augmented data, we constrain all the examples within a reference game to have the same number of parts in their full annotation, otherwise the task could be solved by counting parts.Part names are shuffled when creating the augmented data, and part colors correspond to the sequential position of the part name in the templated text.

Implementation Details
We set the size of the reference game context to k = 10 throughout our experiments.During contrastive fine-tuning, we create a text-image matching matrix of size k×k for each generated reference game in our training data by randomly selecting a text description for each tangram distractor from its annotations.We compute matching loss in both directions, from text to images and vice versa.In practice, this is equivalent to creating 2k reference games in both directions, and provides more informative updates.For all experiments, we use an ensemble of three models combined by element-wise multiplication of their outputs.Appendix A.7 provides model-specific implementation details.Appendix A.9 provides a reproducibility list.

Estimating Human Performance
We conduct an initial estimation of expected human performance on the same evaluation task by recruiting an independent group of 217 human participants.Each participant is randomly assigned to one of the four conditions and shown a random sequence of 20 trials from that condition, preventing leakage across conditions.On each trial, we present an annotation from our development set along with the corresponding context of ten tangrams and ask the participant to click the tangram that was being described.We randomly sample one referential context per annotation, which provides coverage over all 125 tangrams and over 600 unique descriptions in each condition.Before the actual test trials, each participant is provided with a fixed set of 10 practice trials with feedback indicating whether they have selected the correct tangram, and if not, we highlight the correct answer.Performance in the practice trials is not considered in our analysis.Appendix A.6 provides further details.While both models perform better than a random baseline (10%) out of the box, we generally observe poor performance with the pre-trained weights (PT).CLIP slightly outperforms ViLT throughout, potentially because it is trained with a contrastive objective similar to a reference game.Whereas ViLT's matching loss is aligned with our goal, it is only one of several losses in its objective.We observe no reliable improvement from adding part information, either textual or visual.The low performance on WHOLE+BLACK indicates the models fail to generalize familiar concepts to abstract shapes and the lack of consistent improvement with part information indicates an inability to reason about the correspondence of text and colored parts.

Results and Analysis
Fine-tuning (FT) dramatically improves performance for both models.Adding part names to the text description improves both models (PARTS+BLACK).However, segmentation information in the form of part coloring without part names (WHOLE+COLOR) shows no benefit.Although ViLT does not benefit from color information alone, the combination with part names (PARTS+COLOR) shows significant added improvement in performance over having access to part information in one of the modalities.Overall, we observe small consistent differences in performance between the two models, except when having access to both part names and colors (PARTS+COLOR), which ViLT effectively uses following fine tuning.This may be because ViLT's tight integration of the modalities in its single encoder allows it to take advantage of the part correspondence information provided when both part names and colors are given.
Human performance follows a similar trend to the fine-tuned models: adding part names and segmentation helps performance, and their benefit is most pronounced when both are provided.Human performance is significantly higher than pre-trained (PT) models across all four conditions.Fine-tuning (FT) closes this gap.Indeed, in the PARTS+COLOR condition, ViLT significantly outperforms mean human performance.To better analyze human results, we fit a two-component Gaussian mixture model to the distribution of individual participants' accuracies (Figure 7).We observe two components for all conditions except WHOLE+BLACK, indicating two distinct sub-populations.For example, for PARTS+COLOR, the low-performing subpopulation has a mean accuracy of 52.5%, while the high-performing has a mean of 83.8%, significantly outperforming the fine-tuned ViLT.It is possible that the lower-performance sub-population is not making full use of the additional information.
Data augmentation (AUG) improves performance for CLIP, but not for ViLT, which even shows a small decrease in performance, although still significantly outperforming CLIP.We hypothesize that the presence of training examples with partial part information complicates resolving the correspondence between parts and their name, resulting in overall lower ViLT performance.We leave further study of this hypothesis for future work.
The augmentation condition fine-tunes the models to handle examples with partial part information, and allows to study the impact of gradually adding part information.We apply the augmentation process to the development data to generate the data for this analysis.Figure 8 shows the effect of gradually adding part information on the probability of the correct prediction, separated by the total number of parts in the example.Overall, part information is beneficial, but with a diminishing return as more part information is added.We observe this for both models, but with a much faster rate for CLIP, which overall shows much lower performance.ViLT is able to benefit from increasing part information, with the benefit diminishing only after four parts are provided.

Discussion
KILOGRAM provides a new window into the visual abstraction capacity of grounded language models and their ability to generalize concepts beyond their photographic appearance, an integral component of human concept representations (Fan et al., 2015).Our experiments show that there is significant room to improve pre-trained models, which should be able to perform zero-shot reference game tasks without fine-tuning as well as humans do (Clark and Wilkes-Gibbs, 1986).The improved performance after fine-tuning indicates the multi-modal architecture itself has the potential for higher performance, which current pre-training regimes likely do not support.In particular, ViLT's improved performance as a function of additional part information suggests that more structured concept alignment may play a role in this effort (e.g., between parts expressed as lexical items and the corresponding elements of the image).
While we focused on the task of reference resolution, KILOGRAM is also well-suited for production tasks (e.g., generating human-like distributions of descriptions or coloring named parts on a blank tangram) as well as instruction-following tasks (e.g., placing pieces in the described configuration to reconstruct a tangram).More broadly, our data emphasizes the need for maintaining well-calibrated distributions over the many different possible ways that people may conceptualize or talk about things, rather than collapsing to a "best" prediction.

Limitations
Although randomly constructed reference games provide an interpretable evaluation metric, they also pose several limitations.Performance is limited by the fact that descriptions were elicited for isolated images.These descriptions do not reflect the kind of pragmatic reasoning commonly deployed by human speakers in reference games to resolve ambiguities (Goodman and Frank, 2016).In other words, annotators were not able to anticipate the necessary level of detail to disambiguate the object from a specific context of distractors, hence the descriptions may be underinformative.Randomly generated reference games may include ambiguities that make them impossible to solve (e.g., two objects that could both plausibly be described as a bird).The possible performance ceiling on these games is likely below 100%.Extending the data through interactive reference games is an important direction for future work.Likewise, our studies of baseline human performance on this task are preliminary.We found that participants clustered into higher-and lower-performing groups, likely reflecting attentional and motivational factors (e.g., some participants may not have fully attended to the provided part information).A better understanding of human behavior is critical before making any clear conclusions comparing humans and model performance.Ultimately, models only outperformed mean human performance significantly only after fine-tuning on approximately 6,600 example reference games.
Our resource contribution and analysis are focused on English.While the data collection design does not make language-specific assumptions, it depends on the availability of proficient speakers, which is limited in contemporary crowdsourcing services for certain languages.Our large collection of visual stimuli is well suited to extend our data collection to other languages and cultures, which may display different abstractions.This is an important direction for future work.Extending our analysis to other languages depends on the availability of pre-trained models in these languages, which may be limited by the availability of aligned language vision data and the computational resources required for pre-training.

A Appendix
A.1 Examples from KILOGRAM

A.2 Collecting Tangrams
We scan all the pages of tangram solutions from Slocum (2003) into JPEG files to extract SVG files of individual tangrams.We use heuristics based on edge and corner detection (Harris et al., 1988) to extract individual tangrams into separate files by detecting the four corners of each puzzle and adding padding. 5We heuristically detect the individual standard pieces in each tangram using corner detection.Because the shapes are standard, we can test if an extracted shape is an expected puzzle's piece and if we obtain the expected number of such shapes.We resize each tangram and all its pieces to a standard size, and label the ID of each puzzle piece consistently across all tangrams.We heuristically and manually validate the outputs, and prune solutions that fail to vectorize properly, for example if the process fails to recover exactly seven pieces.

A.3 Crowdsourcing Qualifications and Survey
The qualifier includes three multiple choice questions aimed to ensure that (a) the annotator describes the abstract shape meaningfully instead of simply describing its geometry; (b) each part description only contains one part (body and arms instead of body with arms); and (c) the part descriptions correspond to the description of the whole shape.We provide a short video tutorial of the task and examples of invalid annotations for workers to view before completing the qualifier.We also collect basic non-identifying demographic data from each worker, including the languages that they speak and their proficiency, if English is their first language, and where they learned English.We retain the correspondence of anonymized hashed worker IDs to the annotations and language information they provide.

A.4 Dense Annotation Sampling
The set DENSE is made of 62 tangrams sampled from FULL and 12 tangrams commonly used in prior work.We sample the 62 tangrams from FULL to represent the diversity of tangrams using the first set of annotations we collect.We plot the annotated tangrams by average log perplexity of whole-shape descriptions with 1 100 smoothing and PSA and apply a 5 × 5 grid to the plot (Figure A.3).Using perplexity and PSA allows us to sample a set of tangrams with diverse degrees of annotation and segmentation agreement.With a relatively high smoothing factor, we are able to spread out the data points, because the majority of the data set has high divergence in descriptions.We randomly pick 12 periphery points to collect more annotations for outliers, uniformly sample 25 from all the 1004 tangrams, and randomly sample 25, one from each grid, to represent the entire distribution.
We calculate average log perplexity of wholeshape annotations for each tangram.Let x(1) , . . ., x(N) be annotations for a tangram, where each annotation is a sequence of tokens x(j) = ⟨x 1 , . . ., x M (j) ⟩ of length M (j) .We create a language model p (j) for every annotation x(j) using all other N − 1 annotations for the tangram: where C x∈x (j ′ ̸ =j) is the number of occurrences of x in the other annotations for the tangram, k is the smoothing factor, total j ′ ̸ =j is the total number of words used in the other annotations for the tangram and V is the vocabulary size of all whole-shape annotations across all tangrams.The log perplexity for annotation x(j) is log P P i ).The log perplexity for the tangram is the average of perplexity values for all its annotations log P P = 1 N N j=1 log P P (j) .We lowercase, stem, and remove stop words before computing the log perplexity.

A.5 Example Inputs for Experimental Conditions
.4 shows how one annotation, including both text and image, appears under the different experimental conditions.For conditions with PARTS annotations, we generate simple English sentences combining the whole shape description with part descriptions using the template <whole shape> with <part>, <part>, ..., and <part>.We add an indefinite article to each singular part description.BLACK images are tangrams with all pieces colored black with white borders.COLOR images are tangrams with each part colored with one of the CSS preset colors in the order of coral, gold, lightskyblue, lightpink, mediumseagreen, darkgrey, lightgrey that correspond to the parts in the annotation.For the augmented condition (AUG), text inputs are whole annotations combined with each possible subset of the part descriptions.Image inputs are tangrams colored in the same way as colored images, but the parts excluded from the subset of part descriptions are colored black instead.All part descriptions in the annotations are randomly shuffled and not consistently associated with any particular color in the images, so that the coloring solely serves as an indication of the ordering of parts in the combined text.

A.6 Human Performance Baseline Details
We recruited an independent group of 233 human participants from the Prolific crowdsourcing platform (https://www.prolific.co/),and asked them to perform the same reference game task we used for model evaluation.Each participant was randomly assigned to one of the four conditions and shown a random sequence of 20 trials from that condition.On each trial, we showed a text annotation from the development set along with the corresponding context of ten tangrams and asked the participant to click the tangram that was being described.The information that was available varied across condition, just as in the model evaluations.
The tangrams were either presented to participants in black-and-white (BLACK) or colored according to their segmentation map (COLOR), and the language was either the whole-shape description alone (WHOLE) or with the parts included (PARTS).In the PARTS+COLOR condition, the parts text was colored to match the image to facilitate visual comparison, providing the same alignment information available to the models.We took several steps to ensure high-quality responses.First, participants began with a fixed set of 10 practice trials to familiarize with the task.For these practice trials, we provided feedback indicating whether they have selected the correct tangram, and if not, we highlight the correct answer.To assess whether participants were paying attention as opposed to responding randomly, we inserted an unambiguous "catch trial" where the target was the square tangram and the description was square.We excluded 16 participants who failed to select the correct target on this trial, yielding a final sample size of 217 participants out of the 233 recruited.
Because our aim was to obtain overall accuracy estimates for each condition, we did not require judgements for every individual annotation and context in the test set.However, we were able to ensure good coverage of the dataset, including annotations from all 125 tangrams and over 600 unique descriptions in each condition.

A.7 Model-specific Implementation Details
For experiments with CLIP, we use the ViT-B/32 variant.We fine-tune using an Adam optimizer with learning rate 5e-8 and weight decay 1e-6.At the end of each epoch, the training data is shuffled and rebatched.We train the models up to 200 epochs and use patience of 50 epochs to select the model with the highest image prediction accuracy on a non-augmented validation set taken from the training data.All images are resized to CLIP's default input resolution of 224 × 224, with white padding to make to rectangle images square.The total number of trainable parameters in CLIP is 151.2M.CLIP models are fine-tuned with either a single GeForce RTX 2080 Ti GPU with 11GB memory or a single Titan RTX GPU with 24GB memory.Fine-tuning takes approximately 40 minutes per epoch for augmented setups (AUG) and roughly 3 minutes for other setups.
For ViLT experiments, we fine-tune with an AdamW optimizer with learning rate 1e-4 and weight decay 1e-2.We use a cosine learning rate schedule with warm-up over the first epoch.We train the models up to 30 epochs with a patience of 10 epochs and follow the same model selection criterion as for CLIP.All images are resized to 384 × 384.The total number of trainable parameters in ViLT is 87.4M.ViLT models are fine-tuned with a single Titan RTX GPU with 24 GB memory.Fine-tuning takes up to 5.5 hours per epoch for augmented setups (AUG) and roughly 15 minutes for other setups.

A.8 Random Generation of Reference Games
In our main experiments (Section 5), we randomly generate reference games subject to constraints (Section 5.1).In particular, we ensure that distractors contained the same total number of parts.We explore the impact of these constraints by repeating our experiments on reference games generated without the constraints.Without the constraints, part counting can help the model disqualify distractors and significantly narrow down the set of likely referents.This is because images with a different

Figure 1 :
Figure 1: Two example tangrams, each with two different annotations.Each annotation includes a wholeshape description (bold), segmentation to parts (in color), and naming of parts (linked to each part).The top example shows low variability with near-perfect agreement, while the bottom shows high variability with divergence of language and segmentation.

Figure 2 :
Figure 2: The two phases of our annotation task.

Figure 3 :Figure 4 :
Figure 3: Part distributions for different head words.Whole-shape head words (shown in descending order of frequency from left) elicit a variety of part head word distributions.Colors are randomly assigned to part head words, but are fixed across all bars.Grey indicates part head words with < 0.005 frequency.

Figure 5 :
Figure 5: SND, PND, and PSA correlations computed over the FULL set.Representative examples of different SND and PSA values are illustrated on the right.Densely annotated examples are highlighted in red.
Figure 6: Illustration of the language and vision modalities under the different experimental conditions.

Figure 8 :
Figure 8: Mean probability assigned to the correct image using fine-tuned CLIP (left) or fine-tuned ViLT (right) on the development set, by number of parts included in text and colored in the images.Curves are separated by total number of parts in the annotation of the target example.Error bands are bootstrapped 95% confidence intervals.

Figure A. 1
Figure A.1 shows example tangrams from our data.Figure A.2 shows examples of the use of the part name head, the most common part head word in the data.All data can be browsed on the data visualization dashboard: https://lil.nlp.cornell.edu/kilogram/.

Figure
Figure A.1: Example tangrams from our dataset.

Figure A. 2 :
Figure A.2: Example tangrams containing the part description head.Each example includes a tangram and its whole-shape description.We highlight the segmentation corresponding to head in each tangram.

Figure A. 3 :
Figure A.3: Sampled tangrams for dense annotation collection: 12 purple points picked from the periphery, 25 red points randomly sampled from each grid, and 25 green points uniformly sampled from all points.
Figure A.5: Mean development probabilities of predicting the correct image in reference games generated without constraints using fine-tuned CLIP (top) or finetuned ViLT (bottom) by number of parts included in text and colored in the images.We separate the curves by the total number of parts in the annotation of the target example.The error bands show the 95% confidence interval of the expected mean at each point by bootstrapping with 1000 resamplings.

Table 1 :
Data statistics for the complete dataset.

Table 2 :
Table1shows basic data statistics.A total Mean and standard deviation of our analysis measures on the three sets.
shows several examples.