VisText: A Benchmark for Semantically Rich Chart Captioning

Captions that describe or explain charts help improve recall and comprehension of the depicted data and provide a more accessible medium for people with visual disabilities. However, current approaches for automatically generating such captions struggle to articulate the perceptual or cognitive features that are the hallmark of charts (e.g., complex trends and patterns). In response, we introduce VisText: a dataset of 12,441 pairs of charts and captions that describe the charts’ construction, report key statistics, and identify perceptual and cognitive phenomena. In VisText, a chart is available as three representations: a rasterized image, a backing data table, and a scene graph—a hierarchical representation of a chart’s visual elements akin to a web page’s Document Object Model (DOM). To evaluate the impact of VisText, we fine-tune state-of-the-art language models on our chart captioning task and apply prefix-tuning to produce captions that vary the semantic content they convey. Our models generate coherent, semantically rich captions and perform on par with state-of-the-art chart captioning models across machine translation and text generation metrics. Through qualitative analysis, we identify six broad categories of errors that our models make that can inform future work.


Introduction
Studies have shown that captions can improve the recall and comprehension of the data that charts depict (Hegarty and Just, 1993;Large et al., 1995). For instance, when a caption emphasizes visually prominent features of a chart, like a peak or a sharply declining trend, readers treat this information as the key takeaway (Kim et al., 2021). Moreover, for people with visual disabilities, captions (or equivalent descriptions such as alt text) are often the only means of accessing the presented data. However, as evidenced by numerous guidelines (Jung et al., 2021), producing high-quality * Both authors contributed equally to this work. chart captions is a non-trivial and laborious manual process. Thus, despite these advantages, charts are only rarely captioned in practice (Lundgard and Satyanarayan, 2022).
To bridge this gap, several research communities have begun to explore methods for automatically generating chart captions, including using templates and heuristics (Demir et al., 2008;Srinivasan et al., 2019), adapting image captioning techniques (Balaji et al., 2018;Chen et al., 2019a), or via data-to-text machine translation (Kantharaj et al., 2022;Obeid and Hoque, 2020). While promising, these approaches have largely produced captions that either describe a chart's construction (e.g., "The graph is plot between 'Number of people' x-axis over 'Movie Genres' y-axis" (Balaji et al., 2018)) or provide statistical summaries (e.g., "Machinery and equipment was the most valuable commodity for Singapore in 2019" (Kantharaj et al., 2022)). However, these captions do not articulate the perceptual and cognitive features that make charts a distinctive and compelling medium for communicating data (e.g., "Prices of Big Tech corporations seem to fluctuate but nevertheless increase over time" (Lundgard and Satyanarayan, 2022)). Indeed, as Lundgard and Satyanarayan (2022) find, both sighted and blind readers strongly prefer captions that express this type of content.
To automatically produce such semantically richer captions, we introduce VisText: a benchmark dataset of 12,441 pairs of charts and captions. VisText makes two key extensions over prior approaches. First, VisText offers three representations of charts: a rasterized image and backing data table, as in previous work; and a scene graph, a hierarchical representation akin to a web page's Document Object Model (DOM), that presents an attractive midpoint between the affordances of chart-as-image and chart-as-data-table. Second, for each chart, VisText provides a synthetically generated caption detailing its construction as well as a crowdsourced caption describing its statistical, perceptual, and cognitive features. These crowdsourced captions represent a substantial increase in data over prior comparable datasets (Mahinpei et al., 2022;Kantharaj et al., 2022).
To demonstrate the possible uses of the VisText dataset, we train three classes of models -textbased caption models, image-guided captioning models, and semantic prefix-tuning. Text-based captioning models fine-tune large language models for VisText's chart captioning task, revealing that both data table and scene graph representations can produce compelling and semantically rich captions. Following recent advancements in image-guided translation (Sulubacak et al., 2020), we leverage the additional visual affordances in chart images to develop image-guided chart captioning models. Finally, since users often have varying preferences about the type of semantic content in their captions (Lundgard and Satyanarayan, 2022), we apply semantic prefix-tuning to each of our models, enabling them to output customizable captions.
Our models generate coherent, semantically rich captions across the VisText charts. Evaluating against standard machine translation and text generation metrics reveals that our models consistently output captions that accurately describe the chart's construction, such as its chart type, title, and axis ranges. Through qualitative analysis of our model's captions, we find that our model competently outputs semantically rich captions that describe data trends and complex patterns. Further, we categorize six common captioning errors that can inform the future development of chart captioning models on the VisText dataset.

Related work
Heuristic-Based Chart Captioning. Automatically generating natural language descriptions of data tables dates back to Reiter and Dale (1997). Demir et al. (2008Demir et al. ( , 2010Demir et al. ( , 2012 survey this early work and describe the process of extracting insights from a chart by evaluating a list of propositions and composing selected propositions together to produce a natural language summary. More recently, data visualization researchers have explored heuristics that calculate summary statistics and templates to assemble natural language "data facts" (Srini-vasan et al., 2019) or descriptions (Cui et al., 2019). While useful, these approaches yield standardized descriptions that lack the variation and linguistic construction that characterize semantically rich captions (Lundgard and Satyanarayan, 2022).
Chart Captioning as Image Captioning. With rapid advances of neural image captioning (Vinyals et al., 2015;Anderson et al., 2018), researchers have begun to adapt these methods for captioning charts. For instance, Balaji et al. (2018) develop a deep learning pipeline that ingests a PNG chart image, classifies the chart type, detects and classifyies textual content present in the chart, and uses this information to generate a textual description. Chen et al. (2019aChen et al. ( ,b, 2020 propose a simpler workflow using ResNet to encode the chart image and an LSTM with Attention to decode it into a natural language description. Both approaches share a pair of limitations. The captions they produce convey relatively simplistic information about the chart (e.g., title, axis labels, etc.) or articulate concepts in visual rather than data terms (e.g., "Dark Magenta has the lowest value"). While both approaches contribute associated datasets, their charts and captions are synthetically generated and may not represent real-world counterparts. SciCap (Hsu et al., 2021) addresses this limitation by scraping real-world charts from 290,000 arXiv papers; however, the baseline models trained on this dataset struggle to generate semantically rich captions.
Chart Captioning as Text Translation. Perhaps closest to our contribution is recent work modeling chart captioning as a data-to-text problem. For instance, Spreafico and Carenini (2020) train an encoder-decoder LSTM architecture to generate a natural language caption from time series data. Similarly, Obeid and Hoque (2020) and Kantharaj et al. (2022) explore how transformer architectures can translate tabular structures into captions. These data-to-text methods are more successful than chart-as-image captioning, yielding captions that better capture relevant information from the charts and have higher BLEU scores. Nevertheless, we observe two limitations with these data-to-text approaches that motivate our contribution. First, data-to-text methods are heavily reliant on access to a chart's data table. In practice, data tables are only rarely published alongside charts and methods that recover equivalent information via OCR experience a significant drop in performance (Kantharaj et al., 2022). Second, the associated datasets do not contain sufficient training examples of captions that express semantically rich insights about the depicted data (i.e., the perceptual and cognitive phenoma that distinguish charts as a medium as distinct from data tables (Lundgard and Satyanarayan, 2022)). As a result, while the generated captions are compelling, they are largely limited to reporting statistics which sighted and blind readers prefer less than captions that convey complex trends and patterns (Lundgard and Satyanarayan, 2022).

The VisText Dataset
We designed the VisText dataset in response to two limitations existing datasets present for generating semantically rich chart captions. First, existing datasets represent charts as either rasterized images or as data tables. While useful, these representations trade off perceptual fidelity and chart semantics in mutually exclusive ways -images capture the perceptual and cognitive phenomena that are distinctive to charts (e.g., trends or outliers) but pixels cannot express the rich semantic relationships between chart elements (e.g., estimating plotted data values using axis labels). While the vice-versa is true (Lundgard and Satyanarayan, 2022), tables also present additional caveats. There is not always a one-to-one relationship between the semantics of a data table and chart (i.e., one data table may be the source for several distinctly different charts). Moreover, data tables are rarely published alongside charts; and, automatic data table extraction is error-prone due to the diversity of chart types and visual styles as well as the difficulty of reasoning about visual occlusion (Kantharaj et al., 2022;Luo et al., 2021;Jung et al., 2017)).
Second, if existing datasets provide captions that describe perceptual or cognitive features, these captions comprise only a small portion of the dataset. At best, LineCap (Mahinpei et al., 2022) offers 3,528 such captions for line charts only, while Chart-to-Text (Kantharaj et al., 2022) estimates that roughly 15% of the sentences in its captions across a variety of chart types express such content.
In contrast, VisText provides 12,441 crowdsourced English captions that articulate statistical, perceptual, and cognitive characteristics of bar, line, and area charts. In VisText, charts are available as not only data tables and rasterized images but also as scene graphs. Scene graphs are hierarchical representations that better preserve perceptual fidelity and chart semantics, are often the format for publishing web-based charts, and can be recovered from chart images (Poco and Heer, 2017).

Data Table Collection
The data tables found in VisText are sourced from the Statista dataset of the Chart-to-Text benchmark (Kantharaj et al., 2022). The tables were collected by crawling Statista.com in December 2020 and contain real-world data related to technology, trade, retail, and sports. We process these tables to make them amenable for chart generation, including stripping formatting symbols (e.g., $ and %), standardizing data strings, and identifying the measure type of each column (i.e., quantitative, categorical, or temporal). Data tables are discarded if they do not contain at least one quantitative field and one categorical or temporal field, or if other errors occur during the processing steps. We further down select to data tables containing between 2 to 20 columns and 10 to 500 rows. If a data table has over 500 rows, we randomly sample rows. In larger data tables, this step potentially affects how salient a trend is.

Chart Generation and Representation
Charts in the Chart-to-Text Statista dataset all feature the same layout and visual appearance. In contrast, we aim for richer visual diversity by generating charts using the Vega-Lite visualization library (Satyanarayan et al., 2016) via the Python Altair package (VanderPlas et al., 2018). To facilitate collecting high-quality captions, we focus on univariate charts: charts that depict one quantitative observation against a categorical or temporal variable. This focus is informed by recent work in the data visualization research community which has chosen single-series line charts as the target of study for natural language descriptions (Kim et al., 2021;Stokes et al., 2022). VisText also includes single-series bar and area charts as they typically exhibit similar perceptual features to line charts.
For each data table, we iterate through pairs of univariate fields. If the pair contains a temporal field, we randomly generate an area or line chart; if the pair contains a categorical field, we randomly generate a horizontal or vertical bar chart. For diversity in layout and visual appearance, we randomly rotate axis labels and apply one of fourteen themes provided by the Vega-Lite library. These themes mimic the visual style of common chart platforms or publishers (e.g., ggplot2 or the LA Times).

Generated L1 Caption
Here is a area chart is labeled Cumulative number of patients diagnosed with coronavirus  in Japan as of December 4, 2020, by place of infection. On the x-axis, Month is measured with a categorical scale starting with April and ending with October. There is a linear scale with a minimum of 0 and a maximum of 150,000 along the y-axis, labeled Patients within Japan.

Crowdsourced L2/L3 Caption
By December 4th 2020, approximately 160,000 people in Japan had been diagnosed with COVID-19. The first person diagnosed with COVID-19 in Japan was diagnosed in March 2020. The greatest increase in cumulative number of patients in Japan diagnosed with COVID-19 occurred between November and December 2020.

Month Patients within Japan
Feb 11, 2020 16 ... ... Figure 1: The VisText dataset consists of 12,441 charts represented as a rasterized image, data table, and scene graph. Before model training, each data table and scene graph is processed from its original form (shown) to a minimized and linearized text representation. Each chart is accompanied by a generated L1 caption describing the aspects of the chart's construction (e.g., chart type and axis labels) and a crowdsourced L2/L3 caption describing summary statistics and interesting trends (Lundgard and Satyanarayan, 2022).
In VisText, each chart is represented as a rasterized image, stored as an RGBA-encoded PNG file, as well as a scene graph. A scene graph is a textual representation of the rendered chart similar to a web page's Document Object Model (DOM). Scene graphs encode the position, value or content, and semantic role of all visual elements within a chart, including the individual marks (i.e., bars or points along the line), titles, axes gridlines, etc. Thus, scene graphs express the perceptual features of rasterized images in a more computationallytractable form.
Scene graphs are a standard data structure for representing vector-based graphics -the most common format for publishing visualizationsand, thus, can be trivially recovered (e.g., by traversing the SVG text string). We extract the scene graph directly from the rendered chart using the Vega-Lite API. As most text generation models expect a linear set of input tokens, we flatten the scene graph via a depth-first traversal. To scale to large language models, we need to further reduce the size of the scene graph. Thus, we preserve the following elements which we hypothesize as being most critical for generating semantically rich captions: title, title coordinates, axis labels, axis label coordinates, axis tick coordinates, mark coordinates, and mark sizes. VisText includes both the original (hierarchical) and reduced (linearized) scene graphs.

Caption Generation and Collection
Our captioning process is guided by the framework developed by Lundgard and Satyanarayan (2022), which identifies four levels of semantic content: L1 content enumerates aspects of the chart's construction (e.g., axis ranges); L2 content reports summary statistics and relations (e.g., extrema); L3 content synthesizes perceptual and cognitive phenomena (e.g., complex trends); and, L4 content describes domain-specific insights (e.g., sociopolitical context). In subsequent studies, the authors find that while sighted readers typically prefer higher levels of semantic content, blind readers are split about the usefulness of L1 and L4 content. Thus, given these differing preferences, we define a single caption to express multiple levels of content separated across clauses or sentences. We only consider the first three levels of this model, and leave L4 content to future work. Following guidelines prescribed by the National Center for Accessible Media (NCAM), our captions begin with L1 content and then turn to L2 and L3 content (Gould et al., 2008).
We algorithmically generate L1 content and use a crowdsourced protocol to collect L2 and L3 content. This approach follows (Lundgard and Satyanarayan, 2022)'s computational considerations as well as results from Morash et al. (2015) who find that, even with instructions and guidelines, crowd workers do not describe a chart's structural elements sufficiently for blind readers. Thus, synthetically generating L1 content allows us to ensure that captions convey complete descriptions of the chart's structural elements. L1 content comprises 1 sentence conveying the chart type and title, and then 1 -2 sentences describing the axes (including the titles, ranges, and scales). We use template randomization to generate a diverse range of L1 captions to mimic human variability and reduce the capacity of the model to overfit to a single L1 style. Three templates are defined for the first sentence and twenty-six template combinations for the subsequent sentences. During generation, we randomly select a pair of templates and fill in in- Figure 2: The VisText dataset contains data table and scene graph representations of each chart paired with L1 and L2/L3 captions. The distributions and means (dotted lines) of representations (left pair) and captions (right pair) are shown. As the distribution of chart representations has a long tail, we split it into two charts at 2,500 characters to better display the tail by re-scaling the y-axis of the second chart.
formation from the abstract chart specification. For additional diversity, we randomly drop scale information and swap template words with synonyms. Templates and synonym replacements are listed in Appendix E.2.
To crowdsource L2 and L3 content, we extend the protocol used by Lundgard and Satyanarayan (2022). After soliciting consent, we introduce the task: participants are presented with a chart image and corresponding L1 description; they are asked to write a description about the trends and patterns they observe without drawing on background knowledge or repeating L1 information. The introduction provides examples and explanations of valid and invalid responses. After acknowledging these examples, participants are asked to complete 5 random iterations of the task. To maximize the quality of our crowdsourced captions, we manually curated the charts and L1 descriptions used in the study. We discarded any charts that were challenging to read (e.g., colors were too similar, marks were not easily readable, etc.). Participants were recruited on the Prolific.co platform, took approximately 14 minutes to complete the study, and were compensated $3.25 ($14/hour). Additional details on our crowdsourcing process are in Appendix E.3.
We manually verified charts where participants failed an attention check and discarded invalid descriptions. Additionally, we manually inspected captions for personally identifiable information or offensive content. Using heuristics, we removed captions where respondents described charts as unclear or illegible and replaced newline characters with spaces. Although we attempted to fix incorrect spelling and casing errors using a similar heuristic-based approach, we observed that this process could improperly affect axis and chart names. As a result, these errors remain in our dataset. Figure 2 shows the distribution and means of the lengths of chart representations and captions. Synthetically generated L1 captions have roughly 1.5x more characters than crowdsourced L2/L3 captions (µ = 255 vs. µ = 177) but the average number of sentences are comparable (2.5 vs. 2). The VisText dataset consists of captions for 3,189 area charts, 6,238 bar charts, and 3,014 line charts -the roughly twice-as-many bar charts as area or line charts corresponds to the randomization of temporal fields during chart generation (Sec. 3.2). As some charts have multiple crowdsourced captions, we randomly split our dataset into training, validation, and test sets using the chart IDs to prevent data leakage across sets. This resulted in an approximate ratio of 80:10:10.

Dataset Analysis
Finally, to understand the distribution of semantic content, we manually coded 2% (230) of crowdsourced captions. We followed a protocol inspired by Lundgard and Satyanarayan (2022) by breaking sentences down into independent statements and mapping these statements to their semantic content level. We marked statements as not categorizable if they did not map to the framework -for instance, if captions expressed commentary from crowd workers such as "this chart is hard to read." Our analysis revealed 11 L1 statements (2.4%), 180 L2 statements (39.7%), 253 L3 statments (55.7%), and 10 not categorizable statements (2.2%). While a handful express L1 content, the bulk of statements (95%) express L2 or L3 content, with approximately 1.4x L3 statements than L2.

Chart Captioning Models
To demonstrate the affordances of the VisText dataset, we train three classes of models. First, we fine-tune large language models to translate from textual chart representations to natural lan-guage captions. These models evaluate the feasibility and impact of scene-graph models compared to prior data-table approaches (Kantharaj et al., 2022). Second, as VisText provides multiple chart representations, we adapt image-guided translation (Sulubacak et al., 2020;Cho et al., 2021) to develop two multimodal chart captioning models: image-scene-graph and image-data-table. Finally, since VisText offers captions at different semantic levels and prior work has shown significant differences in readers' preferences (Lundgard and Satyanarayan, 2022), we explore prefix-tuned models that selectively output L1, L2/L3, or L1+L2/L3 captions. Training details are in Appendix D.

Text-Based Chart Captioning
Informed by prior work (Kantharaj et al., 2022), we investigate text translation models for generating chart captions. In particular, Kantharaj et al. found that models that translate data tables to chart captions significantly outperform image captioning models. However, when data tables were not available, the authors found a significant drop in their models' ability to extract relevant information from the chart -an effect that was only slightly ameliorated by using OCR methods to extract text from chart images. In contrast, VisText's scene graphs can be more readily recovered from charts when data tables are not available -for instance, by processing the SVG format of web-based visualizations. Moreover, scene graphs offer a potentially richer source of information than data tables as they encode visual properties of the chart (e.g., coordinates and colors) and are less noisy than tokens recovered via OCR. Thus, to evaluate the feasibility and efficacy of scene graphs, we train a scene-graph text translation model and a baseline data-table model for comparison.
For each model, we fine-tune a pretrained ByT5 transformer model (Xue et al., 2022) on the Vis-Text dataset. We choose ByT5 over T5 transformers (Raffel et al., 2020) because it uses a token-free, byte-encoding that eliminates the use of a tokenizer. As a result, it is robust to noisy inputs, minimizes the need for text preprocessing, and eliminates the out-of-dictionary problem. This allows our model to handle common typographical and chart reading errors in the crowdsourced L2 and L3 captions and increases generalizability to previously-unseen words that could be present in chart and axes titles.

Image-Guided Chart Captioning
Following recent advancements in image-guided machine translation (Sulubacak et al., 2020), we train image-guided captioning models using the VisText dataset. Images have improved text-based machine translation models by providing visual information complementary to natural language inputs. Similarly, chart images can contain visuals complementary to the textual specification. For instance, visual affordances that are important for perceiving a trend (e.g., gestalt relations, relative sizes/areas, etc.) may be obfuscated in the scene graph but better captured in the chart image.
We train three image-guided chart captioning models: image, image-scene-graph, and image-data-table. All models leverage the vision-language transformer model VL- T5 (Cho et al., 2021). VL-T5 is pretrained on image captioning and visual grounding tasks and was successfully applied to machine translation, making it suitable for chart captioning. We extract visual features for each VisText chart image using a Bottom-Up Feature Extractor (Anderson et al., 2018). To explore the value of images to chart captioning, our image model only takes in the image features, while image-scene-graph and image-data-table concatenate the image features with the chart's textual representations (scene graph or data table).

Semantic Prefix-Tuning
In real-world chart captioning settings, users want to vary the level of semantic content in their captions. For instance, while some blind users want verbose captions that describe the chart visuals, sighted users may only want captions that help them expose data trends (Lundgard and Satyanarayan, 2022). To develop models capable of such customization, we leverage prefix-tuning strategies alongside VisText's semantic caption breakdown. Prefix-tuning specifies a task alongside the input, permitting a single large language model to perform many different tasks. In our setting, we use prefix-tuning to specify the level of semantic content to include in the caption (Li and Liang, 2021).
We train each of our models with and without semantic prefix-tuning. With semantic prefix-tuning, we treat chart captioning as a multi-task fine-tuning problem, where the model is trained to generate the L1 and L2/L3 captions separately. In every epoch, the model sees each VisText chart twice, once with the L1 prefix and caption and once with the L2/L3 prefix and caption.

Evaluation and Results
To evaluate the VisText dataset and our chart captioning models, we measure the readability and accuracy of generated captions and their similarity to the VisText target caption. We also qualitatively analyze the descriptiveness of generated L2/L3 captions and categorize common errors.

Quantitative Model Performance
We evaluate the results of our text-based and imageguided captioning models with and without prefixtuning. We also compare to a current state-of-theart chart captioning model that uses data table chart representations and a T5 generation model (Kantharaj et al., 2022). To measure the quality of output captions, we evaluate each model on machine translation and language generation metrics (Table 1).
Chart images do not support captioning. The image model performs the worst of all the chart captioning models. Its low perplexity and high error rates indicate it is highly confident in its inaccurate captions. While chart images contain the same information encoded in the chart's textual representations, it is presumably not adequately extracted by the model. Both the image model backbone (Cho et al., 2021) and the visual feature extractor (Anderson et al., 2018) are trained on natural images, making chart images out-of-distribution inputs that are likely to be poorly represented by these vision models. As the chart captioning task grows, model backbones, architectures, and feature extractors could be customized to chart images, which may improve image-based chart captioning.
All models produce high quality L1 captions. In our chart captioning setting, relation generation (Wiseman et al., 2017) measures how often the chart title, axis names, and axis scales in the input appear in the caption. Every model (except image) achieves a similarly-high relation generation score, indicating that every model can generate detailed L1 captions.
Scene graphs perform as well as data tables. Models trained on scene graph representations achieve similar performance across the evaluative metrics to models trained on data tables. As scene graphs can be more easily extracted from web-based charts images, they may be the preferred representation for future chart captioning models.

Image-guiding does not improve captioning.
Our image-guided captioning models do not experience the significant increase in performance other image-guided translation tasks report. While in image-guided translation, images contain substantial additional information beyond the text, the image and textual representations in chart captioning often contain highly similar information. The small amount of additional information in images might benefit complex captioning tasks on multivariate charts or infographics; however, the current VisText captions rarely reference visual information not present in the scene graph or data table.
Prefix-tuning is free. Adding semantic prefixtuning to our models does not significantly change their performance. Models trained with and without prefix-tuning are exposed to the same set of charts, so it is consistent that prefix-tuning would not impact the quality of output captions. Given prefix-tuned models are able to output L1, L2/L3, and L1+L2/L3 captions, prefix-tuning may be preferred if users require semantic customization.

Qualitative Caption Evaluation
To augment our quantitative evaluation, we qualitatively assess the descriptiveness and accuracy of the generated chart captions. Since L1 caption accuracy can be measured at scale via relation generation, we focus our evaluation on L2/L3 predictions.
Prior analysis tasked annotators with comparing the accuracy, coherence, and fluency of generated captions compared to a target caption (Kantharaj et al., 2022). Instead, our approach follows an inductive qualitative data analysis approach: iteratively analyzing captions in a "bottom-up" fashion to identify emergent patterns in how generated captions compare to the ground truth (Bingham and Witkowsky, 2021). We randomly sample 176 generated captions from the scene-graph model with prefix-tuning and break them into their independent L2 and L3 statements, resulting in 181 (48.27%) L2 statements and 194 (51.73%) L3 statements.
two L2 statements. Often this takes the form of a L3 general trend statement (e.g., "The median annual family income in Canada has increased from 2000 to 2018") accompanied by an L2 minimum and maximum statement ("The highest was in 2015 at 80k and the lowest was in 2000"). For the remaining half of analyzed captions, we identified the following recurring types of errors: Identity Errors. We identify 86 identity errors (22.93% of analyzed statements). An identity error occurs when an L2 or L3 statement incorrectly reports the independent variable for a given (often correctly identified) trend. For bar charts, this error means incorrectly reporting the categorical label associated with a bar (e.g., in Appendix Figure 5c: "The most popular music activity is vinyl albums and vinyl singles" should be "The most popular music activity is tickets for festivals"). For area and line charts, this error means incorrectly identifying the temporal point or range of the trend. With bar charts, in particular, we observed that the identities were often "off-by-one" (i.e., identifying a minimum or maximum value, but attributing it to the second-highest or second-lowest category).
Value Errors. A value error occurs when the quantitative data value of a statement is incorrect.
Of the captions we analyzed, 3.20% (12) of statements contained a value error. For instance, as shown in Appendix Figure 4c, for the caption "The total gate revenue from sporting events worldwide by region from 2006 to 2015 has increased from around 15 billion dollars to around 15 billion dollars", the value should be around 18 billion dollars. If it is ambiguous whether an error is an Identity or Value Error, we classify it as the former.
Direction Errors. A direction error occurs when the direction (which can be increasing, decreasing, or stable) of a trend in an L3 statement is incorrect. We uncovered 32 direction errors (8.53% of analyzed statements). For instance, in the caption "The per capita consumption of sweet corn in the US has increased from 2000 to 2019" (Appendix Figure 3c), the trend is actually decreased. In most direction errors, the identity (i.e., temporal range) is correct.
Stability Errors. A stability error occurs when the magnitude of a direction or the variance in a trend is incorrect. This can often refer to how much a trend is increasing or decreasing, such as rapidly or slowly, as well as whether it's a steady change or highly-fluctuating change. In Appendix Figure 4b, "The comparable sales growth of Sam's Club in the United States from fiscal year 2006 to 2020 has been steadily decreasing from 2006 to 2020." should read "The comparable sales growth of Sam's Club in the United States from fiscal year 2006 to 2020 has been highly-fluctuatingly decreasing from 2006 to 2020." 1.07% (4) of the statements we analyzed contained this error.
Repetition. Repetition is when a caption repeats a previously-generated claim, regardless of its correctness. 117 (31.2%) statements contained repetition, making it the most common error we encountered. For example, in Appendix Figure 4a, we see "The average age at widow hood in the Netherlands has increased from 2008 to 2018. The average age at widow hood in the Netherlands has increased from 2008 to 2018." Repetition is a known problem for text generation models with transformer backbones, like our chart captioning models (Fu et al., 2021).
Nonsensical Errors. If a L2 or L3 statement cannot be understood in context of the chart, or makes a fundamental mistake in interpretation, we label it as nonsensical error. We encountered 20 nonsensical errors in addition to the 395 statements we analyzed. For example, in Appendix Figure  5b, "The most popular visitors was Harry Potter in 1999 and 2009." misinterprets the chart. It might instead correctly read "The destination with the most visitors after the TV/movie's release was New Zealand for The Lord of the Rings".

Discussion
We present VisText, a chart captioning dataset of 12,441 charts and semantically rich captions. The VisText charts are represented as a rasterized image, data table, and scene graph to provide diverse and complementary data modalities. Using VisText, we fine-tune large language models to generate natural language captions from textual chart representations and integrate image-guided chart captioning to leverage multimodal information. Utilizing the varied semantic content in VisText captions, we develop semantic prefix-tuned models that output semantically customized captions to meet diverse user needs. Evaluations reveal that our models output precise and semantically descriptive captions, performing on par with state-of-the-art chart captioning models (Kantharaj et al., 2022) across machine translation and text generation metrics. Looking ahead, while accessibility remains a key domain that would benefit from automated chart captioning, and deploying automated chart captioning models into the field is an exciting prospect, we believe the most promising approach for future work lies in "mixed-initiative" (i.e., human + AI) chart authoring systems. In particular, as we describe in our Ethics Statement below, chart captioning models are currently prone to make a number of factual inaccuracies which can have severe harmful consequences. On the other hand, by integrating these models into chart authoring systems (e.g., Tableau, Charticulator, Data Illustrator, or Lyra), chart authors can intervene and make any necessary corrections. Indeed, such integration offers exciting opportunities to develop novel interactive methods for verifying generated captions. For instance, models like ours could generate an initial caption (or set of captions) based on the chart currently being authored; as the system has access to all three representations of the chart (the back-ing data table, chart image, and structured scene graph), it might automatically segment the caption into independent "data segments" and interactively link and map them to rows in the table or regions on the chart, akin to Kori (Latif et al., 2021).

Limitations
Computational Constraints. Despite using modern GPUs, with large amounts of memory, we were forced to use the smallest-parameter variants of T5 and ByT5 as we encountered out-of-memory errors with the larger alternatives. More problematically, the quadratic relationship between sequence length and time/space complexity of transformer architectures (Vaswani et al., 2017), especially when using byte-level sequences (Xue et al., 2022), has had a significant impact on our model performance.
In particular, to be computationally tractable, we were forced us to truncate our input and output sequences to, at most, 1,024 and 512 characters respectively (1,024 coming from the underlying ByT5 architecture (Xue et al., 2022)).
These character thresholds have likely had an outsized effect on scene-graph models. For instance, due to these character limits, we reduced scene graph sequences to only a minimal set of visual characteristics; VisText also includes the raw, unprocessed scene graphs which offer a richer source of information about the visual features that are important to how people decode charts (e.g., bounding boxes, color) but were unavailable to our models. Moreover, as Figure 2 shows, even with this reduced representation, the mean length of scene graph sequences is 948 characters (cf. 426 characters for data tables) with a wide distribution. Thus, despite scene-graph models achieving comparable performance to data-table models, the former saw a much smaller proportion of complete sequences as compared to the latter. This truncation step additionally negatively impacts charts with long titles or axis names -in such cases, we observed that the L2 or L3 caption would be altogether truncated before generation.

Chart Types and the Visualization Design Space.
VisText is scoped to only univariate bar, area, and line charts. We chose to begin with these chart types informed by data visualization research that has focused on studying natural language descriptions of single-series line charts -a basic, but commonly occurring chart type that offers a compelling target of study as it most visibly surfaces any poten-tial trends in the data (Kim et al., 2021;Stokes et al., 2022). Future work can now begin to consider more complex chart forms in a step-by-step manner. For instance, moving from univariate bar, area, and line charts to multivariate versions of these chart types (i.e., stacked bars and areas, grouped bars, and multi-series line charts). From there, work can also consider chart types that surface perceptual and cognitive phenomena in visually distinct ways (e.g., scatterplots, where trends appear as clusters of points; heatmaps, where color saturation often encodes a trend; or maps, where color or other layered elements such as symbols are used to represent data values). Finally, automated methods for captioning visualizations may eschew chart typologies altogether in favor of visualization grammars -by offering a more composable and combinatorial approach to the design space (Wilkinson, 2012), learning over visualization grammars may offer a more robust approach to captioning highly customized or unique visual forms.
For each future work direction, we anticipate scene graph representations to prove more fruitful than the data table. As the complexity of the visualization increases, its relationship to the data table only grows more ambiguous; the scene graph, on the other hand, directly encodes the visual form and thus remains faithful to it. As a result, to support such future work, VisText provides the raw specifications used to produce our charts (via the Vega-Lite visualization grammar (Satyanarayan et al., 2016)) as well as the raw, hierarchical scene graphs prior to our linearization and reduction step.

Ethics Statement
The Consequences of Incorrect Captions. Weidinger et al. (2021) comprehensively survey the risks associated with the large language models (LLMs) that underlie our contribution. Of the six categories of risk they identify, harms stemming from models producing factually incorrect statements are not only most pertinent to our work, but are likely heighted as compared to general uses of LLMs given the context we are addressing: automatically captioning charts. In particular, people most often consume charts and visualizations in order to make data-driven decisions (Keim et al., 2008) -for instance, about whether to evacuate ahead of a hurricane (Padilla et al., 2018), or health & safety during the pandemic (Shneiderman, 2020). Moreover, recent results have shown that readers not only fixate for longer and are more likely to recall the textual content of and around visualizations (Borkin et al., 2015) but this textual content can strongly influence the takeaway message readers leave with even when it is at odds with the depicted data (Kong et al., 2018(Kong et al., , 2019. Finally, these issues are exacerbated by the persuasive and rhetorical force of data and charts (Kennedy et al., 2016;Hullman and Diakopoulos, 2011), that often project a sense of authority and certainty (Correll, 2019). As a result, readers may not think to double check the accuracy of chart captions, and inaccurate statements that models may produce could lead to harmful downstream decisions.
To proceed ethically with this line of research, we believe that advances in data and modeling need to be closely followed by attention devoted to mitigating the risks of incorrect statements. At base, automatically generated captions should be identified as such at the forefront to raise readers' awareness about the potential for incorrect statements. And, interactive visual linking strategies (such as those explored by Kong and Agrawala (2012); Kim et al. (2018)) could be deployed to help readers manually verify the constituent statements of a caption against the chart. These strategies, however, place the burden of harm mitigation on readers. Thus, an alternate approach might never surface automatically generated captions to readers directly but instead use them as part of mixed-initiative systems for jointly authoring visualization and text, such as Kori (Latif et al., 2021). In such systems, automated chart captioning models would help to accelerate the authoring process -combatting the blank slate problem by providing an initial summary of the chart -and chart authors would make any necessary corrections prior to publication.
Besides these human-computer interaction (HCI) approaches for mitigating harm, an equally important direction for future work should leverage interpretability techniques to more deeply study what the models are learning. To what degree are chart captioning models stochastic parrots (Bender et al., 2021), and how much do they understand the information charts depict?
Automated Captioning for Accessibility. Although accessibility is a guiding motivation for the bulk of work in automated captioning (be it image captioning or, as in our case, chart captioning), studies find mixed reactions, at best, about these approaches among people with disabilities (PWDs).
For instance, accessibility educator and researcher Chancey Fleet described Facebook's automatic image descriptions as "famously useless in the Blind community" despite "garner[ing] a ton of glowing reviews from mainstream outlets" (Fleet, 2021;Hanley et al., 2021). This disconnect appears to stem from a more fundamental mismatch between what PWDs describe as their captioning needs, and what the research community -particularly through its automatic, quantitative evaluationsprioritizes (Jandrey et al., 2021). In particular, surveys with PWDs repeatedly surface the contextual nature of captions. Bennett et al. (2021) find that the context of use shapes the degree to which PWD are comfortable with captions describing people's race, gender, and disabilities -for instance, changing their preferences if they were in a white, cisgender, nondisabled, and professional company versus their own community. Similarly, Jung et al. (2022) find shifting preferences for the content image descriptions should convey across different photo activites -for example, when viewing or taking photos, participants wished for descriptions that conveyed spatial cues whereas when searching or reminiscing about photos, participants hoped for descriptions to connect to personal data or differentiating details.
In contrast, quantitative metrics of model performance compare generated captions to a single "ground truth" caption. This framing of success not only makes it difficult to develop contextuallyvarying caption generation but can actively penalize such investigations. For instance, with our work, we explored how prefix-tuning can be used to develop models that are responsive to users' preferences about semantic content. However, as described in Sec. 5.1, existing quantitative metrics of model performance (e.g., BLEU, ROUGE, WMD, and TER) show a drop in model performance despite our qualitative analysis indicating that these captions are indeed high quality.
Finally, our exploration of semantic prefixtuning represents only a very preliminary step towards addressing the contextual captioning needs of PWDs. In particular, the semantic labels Vis-Text assigns to captions were derived from prior work (Lundgard and Satyanarayan, 2022) that only explored natural language descriptions when consuming presentations of visualizations -one task from a broader palette (Brehmer and Munzner, 2013). Future work might instead extend the Vis-Text dataset -and corresponding models -to consider captions for a broader range of tasks including consuming visualizations for scientific discovery, enjoyment or, producing, searching, or querying visualizations (Brehmer and Munzner, 2013

Model Generated L1 Caption
Here

Model Generated L1 Caption
Impact of coronavirus (COVID-19) on monthly retail sales development in the United States 2020, by retail sector is a bar graph. The y-axis measures retail sector while the x-axis measures July to August.

Model Generated L2/L3 Caption
The total retail sector has been impacted on clothing and clothing accessories sectors the most. The lowest impact was in sporting goods.
(a) The scene-graph model's output L1 caption and L2/L3 caption for a VisText bar chart of the "Impact of coronavirus (COVID-19) on monthly retail sales development in the United States 2020, by retail sector". The model correctly identifies the chart's title and axis, and it correctly identifies the the most and least impacted sectors.

Model Generated L1 Caption
Here a bar graph is labeled Number of visitors to destinations which featured in select screen products before and after the film/TV series release (in thousands) as of 2011. The y-axis measures Title/location/years while the x-axis measures After release.

Model Generated L1 Caption
Average spending per consumer on selected music activities in the United States as of July 2018 is a bar graph. The x-axis measures Response while the y-axis measures $40 to $99.99.

Model Generated L2/L3 Caption
The most popular music activity is vinyl albums and vinyl singles. The least popular music activity is vinyl albums.
Since we compute Relation Generation using only the L1 chart fields (e.g., chart title, axis scale, etc.), we do not report the results separately for L1 versus L2/L3 captioning. There is no direct Relation Generation analog for L2/L3 captions, since they are human-generated and do not follow a specific template. The Relation Generation for L1 captions is identical to the Relation Generation for L1/L2/L3 captions reported in Table 1.

B.2 Evaluation Details
Quantitative Model Performance Metrics. We evaluate our models using NLP and machine translation metrics, including BLUE (Papineni et al., 2002;Lin and Och, 2004) • BLEU: BLEU requires several gold standard references. In our evaluation setup, we use the test set caption as a single reference.
• Perplexity: We use a pretrained GPT-2 Medium model to compute Perplexity.
• Relation Generation: The fields we evaluate on are the chart title, axis names, and axis scales (if any).
• Translation Edit Rate (TER): Edits consist of deletions, additions, and substitutions, as present in SacreBLEU.
Qualitative Caption Evaluation. To produce our qualitative evaluation results (Sec. 5.2), we iteratively evaluated randomly sampled captions until there was no more marginal information about they types of errors to be gained from evaluating more captions. For each L2/L3 caption, we assess the number of independent, mutually-exclusive L2 and L3 claims/statements that are being made. In comparison to evaluating at a sentence-level, this allows us to take a more nuanced approach that isn't limited by where the model has generated a full-stop. This approach allows us to more-accurately evaluate factual precision without overly-penalizing for a single mistake. An example might take the form of "The lowest value is X (claim 1), the highest value is Y (claim 2), and the second highest is Z (claim 3). Overall, it is increasing over time (claim 4)." We observe that the first sentence is a compound sentence that consists of three independent clauses, each with a single factual L2 claim, while the second sentence is a single factual L3 claim. Let us assume that claim 1 was factually incorrect. If we evaluate at a sentence-level, then the entire first sentence comprising of claim 1, claim 2, and claim 3 would be incorrect. However, by breaking this caption into independent, mutually-exclusive claims, we can more precisely calculate the factual precision of our text generation. (a) Ablation study results using the combined L1L2L3 captions.

C Ablation Studies
To evaluate our modeling and dataset design choices, we run ablation studies measuring the impact of our transformer model backbones and stochastic data generation pipeline. We report the results in Table 3.
Transformer Backbone. To understand the impact of our token-free, byte-to-byte architecture ByT5 model backbone, we explore other large language models. Specifically, we compare our 300M parameter ByT5-small model (Xue et al., 2022) with a 60M parameter T5-small (Raffel et al., 2020) and 140M parameter BART-base model (Lewis et al., 2020). We also apply prefix-tuning to the ByT5 and T5 models. We cannot apply prefix-uning to BART because BART does not support multi-task learning. Quantitatively, using ByT5 does not appear to significantly improve upon T5. However, we theorize that ByT5's token-free paradigm increases the input sequence length by compressing more input text into fewer input tokens.
L1 Caption Generation. Since we generate L1 captions stochastically, we evaluate whether our initial randomization impacted the model's results. We compare generate a second set of L1 captions using a different random seed. We see the results are nearly identical across all metrics, indicating our dataset captures a diverse set of L1 captions.

D Implementation Details
Code to train and evaluate our text-based and image-guided models is available at https://github.com/ mitvis/vistext. Table 4 summarizes our model training parameters.

D.1 Text-Based Chart Captioning
To train our text-based chart captioning models, we use the Huggingface implementation of ByT5 (Wolf et al., 2019). Due to hardware limitations, we use the ByT5-small model, which has 300M parameters. We fine-tune each model for 50 epochs, using Adam optimization with a learning rate of 5e−05. To fit the input features into GPU memory, we truncate the input text (i.e., scene graph or data table) to 1024 tokens and the output caption to 512 tokens. We select the best model epochs based on the validation loss of the validation set. See Table 4 or the VisText GitHub repository 1 for each model's full training details and hyperparameters. We train each model three times with and without prefix-tuning and report the mean and standard deviation in Table 1. We train each model on four NVidia V100 GPUs with 32GB of memory connected by an NVLink2 network. With prefix-tuning, training, evaluation, and inference took approximately 39 hours for the scene-graph model and 11 hours for the data-table models. Without prefix-tuning, training, evaluation, and inference took approximately 78 hours for the scene-graph model and 22 hours for the data-table models. We estimate that we trained each model between 30 to 45 times to achieve our final results.

D.2 Image-Guided Chart Captioning
Our image-guided chart captioning models extend the VLT5 model (Cho et al., 2021), which is a multimodal extension of T5-base. We extract visual features from VisText's chart images using Bottom-Up Feature Extraction (Anderson et al., 2018) and 36 bounding boxes per image. After feature extraction, we fine-tune VLT5 on the VisText dataset for 50 epochs following the default VLT5 training protocol 2 (Cho et al., 2021). To fit the input features into GPU memory, we truncate the input text (i.e., scene graph or data table) to 1024 tokens and the output caption to 512 tokens. After 50 epochs, we select the epoch with the lowest validation loss as the best model. See Table 4 or the VisText GitHub repository 2 for each model's full training details and hyperparameters.
We train each model three times with and without prefix-tuning and report the mean and standard deviation in Table 1. We train each model on four NVidia V100 GPUs with 1TB of memory. The image models take approximately 2 minutes per training epoch without prefix-tuning and approximately 3

D.3 Ablation Models
We train our ablation models using the same parameters as our default models, only varying the parameter of interest. We train them on 16 virtual CPU cores on Xeon E5 hypervisors with 128GB of memory and PCI pass-through access to eight NVidia Titan XP GPUs with 12GB of memory.

D.4 Notable Package Versions
Package versions are listed in Table 5.

E Additional VisText Dataset Details E.1 Licensing
Our use of the raw Statista data from Kantharaj et al. (2022) is consistent with its intended use case. The data was licensed under the GNU General Public License v3.0. We release our data and code under GNU General Public License v3.0.

E.2 L1 Caption Generation Process
The Level 1 captions are generated from a random process that chooses from 3 title templates and 6 axis templates. The title templates we use are: • Our study was determined to be exempt by MIT's institutional review board (IRB). Participants had to explicitly provide their consent in order to proceed with the study -if participants did not consent, they were redirected back to the Prolific platform. The consent statement (Fig. 8) reminded participants of their rights (including that their participation is voluntary and consent could be revoked at any time), and encouraged participants to contact either the study PI or IRB board directly should they have any concerns. We constrained our participant pool (and eligibility requirements) to people living within the United States or United Kingdom who self-reported as being sighted with no vision or color impairments. We did not collect any additional demographic data from participants as we did not determine this to bias or otherwise affect the content we hoped to collect.
Each task (an example of which is shown in Fig. 10) included an attention check where participants were asked to correctly identify the chart type shown. If participants failed more than two attention checks, their submission was flagged for manual review -in practice, the bulk of participants who failed attention checks nevertheless produced valid captions and, thus, were paid fully. The task asked participants to complete a free response question to describe as completely as they could the trends and patterns observed, emphasizing that their response would be evaluated for correctness and completeness. Despite best practices suggesting a more structured, querying approach (called QID) can yield higher quality captions (Morash et al., 2015), we opted for our free-response approach as the benefits of QID (namely, in expressing the chart type, title, and axes units) would already be captured by our synthetically generated L1 captions. Moreover, in contrast to the templatized output produced by QID, we hoped that our free-response responses would yield more "natural" articulations of perceptual and cognitive trends, following the Lundgard and Satyanarayan (2022) framework.

Describing Data Visualizations
You are invited to participate in a research study about describing data visualizations to improve their accessibility to people who are blind or have low-vision. You have been asked to participate because we need to determine what language sighted readers use to describe charts, and the sorts of trends and patterns you identify when reading a chart.
We will ask you to read data visualizations (such as a bar chart or line chart), and to also read a textual description of each visualization. Then, we will ask you to answer questions about each visualization, and to write your responses in English.
Full completion of this study consists of the following: 1. Reading through an introductory example task.
2. Completing up to 5 visualization description tasks, with 1 open-ended question per task (5 text input questions total).
The estimated time to complete this study is 14 minutes.

Consent Statement
Please review the following information about this study.
If you agree to participate in this study, we will ask you to complete a sequence of visualization description tasks.
We expect this study to take 14 minutes, and you will receive $3.25 as compensation. We don't anticipate any risks from participating in this study to be greater than normal activity.
There will be no direct benefits to you other than compensation. No personally identifying information will be revealed to anyone other than the researchers conducting this study. The records of this study will be kept private. In any sort of report we make public, we will not include any information that will make it possible to identify you. Research records will be kept on encrypted serves. Only the researchers will have access to the records. Participating in this study is completely voluntary. If you decline to participate, it will not affect your current or future relationship with the researchers. If you agree to participate, you are free to withdraw at any time. We will delete any data related to your participation upon your request.

Eligibility Statement
You are eligible for this study if and only if: 1. You are sighted and do not have color blindness or other color vision impairment.
2. You are comfortable reading and writing in English.
I have read the above information and AGREE to participate in this study.
I have read the above information and DO NOT AGREE to participate in this study. Full completion of this study consists of the following: 1. Reading through an introductory example task.
2. Completing up to 5 visualization description tasks, with 1 open-ended question per task (5 text input questions total).
The estimated time to complete this study is 14 minutes.

Consent Statement
Please review the following information about this study.
If you agree to participate in this study, we will ask you to complete a sequence of visualization description tasks. We expect this study to take 14 minutes, and you will receive $3.25 as compensation.
We don't anticipate any risks from participating in this study to be greater than normal activity.
There will be no direct benefits to you other than compensation. No personally identifying information will be revealed to anyone other than the researchers conducting this study. The records of this study will be kept private. In any sort of report we make public, we will not include any information that will make it possible to identify you. Research records will be kept on encrypted serves. Only the researchers will have access to the records. Participating in this study is completely voluntary. If you decline to participate, it will not affect your current or future relationship with the researchers. If you agree to participate, you are free to withdraw at any time. We will delete any data related to your participation upon your request. Please enter your Prolific ID here:

Example Task
Thank you for agreeing to participate in this study. To give you an idea of the tasks you will be asked to complete, please read through the following example task.
I have read the above information and AGREE to participate in this study.
I have read the above information and DO NOT AGREE to participate in this study.
I have read the above information and attest that I AM eligible to participate in this study.
I have read the above information and attest that I AM NOT eligible to participate in this study.
${e://Field/PROLIFIC_PID} You are invited to participate in a research study about describing data visualizations to improve their accessibility to people who are blind or have low-vision. You have been asked to participate because we need to determine what language sighted readers use to describe charts, and the sorts of trends and patterns you identify when reading a chart.
We will ask you to read data visualizations (such as a bar chart or line chart), and to also read a textual description of each visualization. Then, we will ask you to answer questions about each visualization, and to write your responses in English.
Full completion of this study consists of the following: 1. Reading through an introductory example task.
2. Completing up to 5 visualization description tasks, with 1 open-ended question per task (5 text input questions total).
The estimated time to complete this study is 14 minutes.

Consent Statement
Please review the following information about this study.
If you agree to participate in this study, we will ask you to complete a sequence of visualization description tasks.
We expect this study to take 14 minutes, and you will receive $3.25 as compensation. We don't anticipate any risks from participating in this study to be greater than normal activity. There will be no direct benefits to you other than compensation.
No personally identifying information will be revealed to anyone other than the researchers conducting this study.
The records of this study will be kept private. In any sort of report we make public, we will not include any information that will make it possible to identify you. Research records will be kept on encrypted serves. Only the researchers will have access to the records. Participating in this study is completely voluntary. If you decline to participate, it will not affect your current or future relationship with the researchers. If you agree to participate, you are free to withdraw at any time. We will delete any data related to your participation upon your request.
You may ask any questions about the research at any time. If you have questions about the research after you leave today you should contact . If you are not satisfied with response of research team, have more questions, or want to talk with someone about your rights as a research participant, you should contact the .

Eligibility Statement
You are eligible for this study if and only if: Please enter your Prolific ID here: I have read the above information and AGREE to participate in this study.
I have read the above information and DO NOT AGREE to participate in this study.
I have read the above information and attest that I AM eligible to participate in this study.
I have read the above information and attest that I AM NOT eligible to participate in this study. Please enter your Prolific ID here:

Example Task
Thank you for agreeing to participate in this study. To give you an idea of the tasks you will be asked to complete, please read through the following example task.

Example Task
In this task, we first present a visualization, followed by a textual description of that visualization. Then, we ask you a question, and provide an example response.
This question the same as the ones you will be asked in the upcoming tasks. Note that the responses are open-ended text input. We ask that you try your best and respond in complete sentences.
Please confirm that you have read and understood each example question by clicking the button below it.

Example Visualization Description
A scatterplot entitled "Cars by Country Origin" that plots cars' Horsepower versus Miles_per_Gallon by Origin. Car Origin includes Europe, Japan, and USA, encoded by Color and Shape: Europe (blue, circle), Japan (orange, square), and USA (red, triangle).
Horsepower is plotted on the horizontal x-axis from 0 to 250 with an increment of 50.
Miles_per_Gallon is plotted on the vertical y-axis from 0 to 50 with an increment of 10.
This concludes the Example Task. You will now be asked to complete up to 5 similar visualization description tasks.

Task
Directions: Please read the data visualization and the corresponding textual description.
Then, please provide a response to the text-input question.

Example Question
Besides what was already said in the above description, what other conclusions can you draw from this visualization? What trends or patterns can you observe?
Please only articulate trends or patterns using information contained in the visualization.
Please do not introduce background knowledge or assumptions about the data apart form what is represented in the visualization. Please state as many observations as you are able. Please state each observation using one complete sentence. Your response will be evaluated for its correctness and completeness.

Example Responses
Japan produces the car with the greatest Miles_per_Gallon. USA produces the car with the fewest Miles_per_Gallon. Most cars have around 40 to 120 Horsepower. Most of the cars can drive around 20 to 40 Miles_per_Gallon.
These sentences are good because they accurately reference information contained in the chart.
Overall, as horsepower increases, miles per gallon decreases. USA cars have the highest horsepower. European and Japanese cars seem to have better miles per gallon.
These sentences are also good because they describe a trend in the data presented by the chart using only the information provided by the chart.

Incorrect Responses
Certain sentences should not be written. For example:

"The cars have between 0 Miles_per_Gallon and 50 Miles_per_Gallon."
While this is an accurate description of the chart, it should not be written because a similar description was already given in the above Example Description.
"European cars are shown as blue circles, Japanese cars are orange squares, and American cars are read triangles." This is a bad response because it focuses on the visual appearance of the chart rather than describing trends or patterns.
"Compared to Europe and Japan, the USA does a poor job of manufacturing environmentally efficient cars. Japanese cars have the highest miles per gallon, and are therefore most environmentally friendly." This is a bad response because it introduces background knowledge about environmental efficiency and friendliness that is not contained within the chart itself.

Task
Directions: Please read the data visualization and the corresponding textual description. Then, please provide a response to the text-input question.

Visualization
What type of visualization is shown above?

Visualization Description
Number of Facebook fans/twitter followers of the Green Bay Packers (NFL) from 2012 to 2020 (in millions) is a area diagram. A linear scale from 0.00 to 2.00 can be found on the yaxis, marked Twitter followers. There is a linear scale from 2014 to 2020 on the x-axis, marked Month.

Question
Besides what was already said in the above description, what other conclusions can you draw from this visualization? What trends or patterns can you observe?
Please only articulate trends or patterns using information contained in the visualization. Please do not introduce background knowledge or assumptions about the data apart form what is represented in the visualization. Please state as many observations as you are able. Please state each observation using one complete sentence. Your response will be evaluated for its correctness and completeness.

Task
Directions: Please read the data visualization and the corresponding textual description. Then, please provide a response to the text-input question.

Scatter Plot
Line/Area Chart Bar Chart Figure 10: An example task from a specific run of the crowdsourcing study.