UniChart: A Universal Vision-language Pretrained Model for Chart Comprehension and Reasoning

Charts are very popular for analyzing data, visualizing key insights and answering complex reasoning questions about data. To facilitate chart-based data analysis using natural language, several downstream tasks have been introduced recently such as chart question answering and chart summarization. However, most of the methods that solve these tasks use pretraining on language or vision-language tasks that do not attempt to explicitly model the structure of the charts (e.g., how data is visually encoded and how chart elements are related to each other). To address this, we first build a large corpus of charts covering a wide variety of topics and visual styles. We then present UniChart, a pretrained model for chart comprehension and reasoning. UniChart encodes the relevant text, data, and visual elements of charts and then uses a chart-grounded text decoder to generate the expected output in natural language. We propose several chart-specific pretraining tasks that include: (i) low-level tasks to extract the visual elements (e.g., bars, lines) and data from charts, and (ii) high-level tasks to acquire chart understanding and reasoning skills. We find that pretraining the model on a large corpus with chart-specific low-and high-level tasks followed by finetuning on three down-streaming tasks results in state-of-the-art performance on three downstream tasks.


Introduction
Information visualizations such as bar charts and line charts are commonly used for analyzing data, inferring key insights and making informed decisions (Hoque et al., 2022).However, understanding important patterns and trends from charts and answering complex questions about them can be cognitively taxing.Thus, to facilitate users in analyzing charts, several downstream NLP tasks over charts have been proposed recently, including chart question answering (Masry et al., 2022;Kantharaj et al., 2022;Lee et al., 2022), natural language generation for visualizations (Obeid and Hoque, 2020;Shankar et al., 2022) and automatic data story generation (Shi et al., 2020).
A dominant strategy to tackle these downstream tasks is to utilize pretrained models (Su et al., 2020;Li et al., 2020b;Kim et al., 2021;Cho et al., 2021) trained on langauge and vision tasks (Du et al., 2022).However, although effective, such models may not be optimal for chart-specific tasks because they are trained on large text corpus and/or image-text pairs without any specific focus on chart comprehension.In reality, charts differ from natural images in that they visually communicate the data using graphical marks (e.g., bars, lines) and text (e.g., titles, labels, legends).Readers can discover important patterns, trends, and outliers from such visual representation (Munzner, 2014).Existing pretrained models do not consider such unique structures and communicative goals of charts.For instance, Pix2Struct (Lee et al., 2022) is a pretrained image-to-text model designed for situated language understanding.Its pretraining objective focuses on screenshot parsing based on HTML codes of webpages, with a primary emphasis on layout understanding rather than reasoning over the visual elements.MatCha (Liu et al., 2022b) extends Pix2Struct by incorporating math reasoning and chart data extraction tasks, but it still lacks training objectives for text generation from charts and it was trained on a limited number of charts.
In this work, we present UniChart, a pretrained model designed specifically for chart comprehension and reasoning.UniChart is pretrained on a large corpus of charts and it aims to serve as a Universal model for various chart-related downstream tasks (Fig. 1).Inspired by the model architecture from Kim et al. (2022), UniChart consists of two modules: (1) a chart encoder, which takes the chart The line chart shows the birth rate in the United States from 2005 to 2019, by poverty status.The numbers represent the number of births per 1,000 women.In 2019, the birth rate for women below the poverty level was 74, for women with an income of 100 to 199 percent of the poverty level it was 61, and for women with an income of 200 percent or more of the poverty level it was 44.The trend over the years shows a decline in birth rates for all income levels, with the highest birth rate in 2005 for women below the poverty level at 95, and the lowest birth rate in 2019 for women with an income of 200 percent or more of the poverty level at 44.The trend over the years shows a decline in birth rates for all income levels, with the highest birth rate in 2005 for women below the poverty level at 95, and the lowest birth rate in 2019 for women with an income of 200 percent or more of the poverty level at 44. image as input, and (2) a text decoder, trained to decode the expected output based on the encoded image and the text input fed in the decoder as task prompt.We performed pretraining on a diverse set of 611K charts that we collected from multiple real-world sources.Our pretraining objectives include both low-level tasks focused on extracting visual elements and data from chart images, as well as high-level tasks, intended to align more closely with downstream applications.One key challenge for pretraining was that most charts in the corpus do not come with informative summaries, which are critical for various downstream tasks.To address this challenge, we used knowledge distillation techniques to leverage large language models (LLMs) for opportunistically collecting chart summaries, which were then used during pretraining.

Input Image and Prompts Output UniChart
We conducted extensive experiments and analysis on various chart-specific downstream tasks to evaluate the effectiveness of our approach.Specifically, we evaluated UniChart on two chart question answering datasets, ChartQA (Masry et al., 2022) and OpenCQA (Kantharaj et al., 2022), and found that it outperformed the state-of-the-art models in both cases.For chart summarization, UniChart achieves superior performance in both human and automatic evaluation measures such as BLEU (Post, 2018) and ratings from ChatGPT (OpenAI, 2022).Moreover, UniChart achieved state-of-the-art results in the Chart-to-Table downstream task.Finally, our model showed improved time and memory efficiency compared to the previous state-ofthe-art model, MatCha, being more than 11 times faster with 28% fewer parameters.
Our primary contributions are: (i) A pretrained model for chart comprehension with unique lowlevel and high-level pretraining objectives specific to charts; (ii) a large-scale chart corpus for pretraining, covering a diverse range of visual styles and topics; (iii) extensive automatic and human evaluations that demonstrate the state-of-the-art performance of UniChart across various chart-specific benchmark task while optimizing time and memory efficiency.We have made our code and chart corpus publicly available at https://github.com/visnlp/UniChart.
Our work is also related to multimodal document understanding tasks that involve analyzing the textual content, layout, and visual elements of documents (Xu et al., 2020b,a;Wang et al., 2022;Huang et al., 2022;Kim et al., 2022;Tang et al., 2022).These tasks can be addressed using encoder-only and encoder-decoder architectures.Encoder-only models rely on OCR engines to extract text from document images and use BERTlike encoders augmented with specialized embeddings to encode layout and visual features (Xu et al., 2020b,a;Wang et al., 2022;Huang et al., 2022).In contrast, encoder-decoder architectures combine transformer-based encoders with autoregressive text decoders for text generation tasks related to documents (Tang et al., 2022;Kim et al., 2022;Lee et al., 2022).While Tang et al. (2022) incorporates an OCR tool to supplement the vision encoder, Kim et al. (2022) and Lee et al. (2022) operate in an end-to-end manner without external OCR engines.In line with the latter approach, our model adopts an end-to-end encoder-decoder architecture (Kim et al., 2022).
In general, the above work focuses on training on large image-text pairs or text corpus, lacking focus on chart understanding.One exception is MatCha (Liu et al., 2022b), a pretrained chart model based on Pix2Struct (Lee et al., 2022), which achieved SoTA on chart question answering and summarization tasks.However, MatCha's pretraining tasks mainly focus on data table generation without focusing on text generation tasks.The model is also pretrained with reasoning tasks using the textual datasets which might limit its visual reasoning ability.Our model is trained on a larger corpus with chart-specific pretraining objectives, including visual reasoning and text generation, making it more versatile for various chart-related tasks.

Chart-related Downstream Tasks
There has been growing interest in solving various chart-related tasks.Chart question answering (ChartQA) tackles questions about charts, with benchmarks like (Methani et al., 2020) and (Masry et al., 2022) targeting factoid questions involving visual and arithmetic reasoning.Open-ended question answering (OpenCQA) task requires an explanatory answer by reasoning with the chart content (Kantharaj et al., 2022).Finally, Chart-to-Text generates natural language summaries from input charts (Shankar et al., 2022), while Chart-to-Table generates underlying data tables (Choi et al., 2019).We evaluate our model on these four chartrelated tasks, as they involve the interaction between language and vision and have publicly available datasets.There are a few other tasks such as infographics understanding (Mathew et al., 2022) and question answering with science diagram (Kembhavi et al., 2016), however, in this work, we only focus on chart-related tasks.

Chart Pretraining Corpus
To build a large and diverse corpus with various styles, topics, and storage formats, we crawled charts from various online sources.Additionally, we utilized publicly available chart datasets suitable for pretraining.The collected charts can be categorized into two types: charts with underlying data tables and charts without data tables.

Charts with Data Tables
Charts with an underlying data table are collected in three ways: (i) utilize existing datasets, (ii) extract SVG charts, and (iii) data augmentation.
• Utilize Existing Datasets Our goal was to train the model based on real-world data, thus, we did not consider the ones that are generated from synthetic data (Kafle et al., 2018;Kahou et al., 2018).In particular, we used the following five chart datasets for which the underlying data tables were available: (i) Statista (statista.com)(Shankar et al., 2022), (ii) Our World In Data or OWID (ourworldindata.org)(Masry et al., 2022), (iii) Organisation for Economic Co-operation and Development or OECD (oecd.org)(Masry et al., 2022), (iv) PlotQA (Methani et al., 2020), and (v) a subset of the ChartInfo (ChartInfo, 2022) dataset that provides bounding box annotations for data encoding marks (e.g., bars in a bar chart).
• Extract SVG Charts: We extracted charts in SVG format from the Chartblocks and Plotly datasets of the Beagle corpus (Battle et al., 2018).These charts do not come with data tables, but the data can be extracted accurately from the SVG elements.The steps for preparing these charts are: (1) identify axis labels and legends using specific class names of HTML attribute, (2) extract bounding boxes of chart elements (e.g., bars, line) using SVG attribute properties (e.g., size and location of <rect>), (3) construct the underlying data table by iterating through each of the <g> elements to find data values of each data attribute.When data labels are absent, we utilize the scale information based on the axis labels and tick marks of the chart and the bounding box information of data encoding marks to recover the data values.
• Data Augmentation We further augmented the corpus by creating charts from publicly available data tables.We used the The Web Data Commons (WDC) (WDC, 2022), which used Common Crawl 1 to collect a large amount of structured data.
The charts are created in the following steps: (i) Data pre-processing: Since many tables in WDC contain more than three columns, we decomposed so that tables are suitable for creating desired chart types (e.g., bars, lines, and pie charts).In particular, we automatically analyze the data type of each column (e.g, numeric vs. categorical) and then randomly choose one column with numeric data values and one/two column(s) with categorical data.We also limit the maximum number of rows of the table to 8 so that the corresponding chart can fit within reasonable screen space.
(ii) Chart generation: To generate visually diverse charts, we used the D3 (Bostock et al., 2011) library that provides great flexibility in terms of creating diverse visualization styles.We also employed Vega-Lite (Satyanarayan et al., 2016) which creates charts based on declarative JSON syntax.We used simple heuristics for determining chart types from the data table (Mackinlay et al., 2007).We created four types of charts: (1) vertical simple bar charts with one numeric data column, (2) vertical grouped bar charts, (3) pie charts, and (4) line charts (both single series and multi-series).
(iii) Visual diversification: To create visually diverse charts resembling real-world variations, we manipulated the following visual style properties: (1) Colors and shapes: Color schemes from Color-Brewer 2 and Tableau 3 were chosen for categorical data attributes.We also varied shape properties such as bar thickness, line types (e.g., continuous vs dotted), and legend shape types (e.g., rect, circle).( 2) Position and distance: We also varied bar positions and distances with respect to axis labels.
(3) Guides: Charts may contain additional guides such as grids, so we generate charts with and without grids to diversify styles.
Fig. 2 depicts a visually diverse set of charts created using this augmentation process.In total, we created a total of 189,839 charts (Table 4).

Charts without Data Tables
Many online charts are available only as images, without corresponding data tables.However, they can still be valuable for large-scale pretraining as we can extract chart elements and rich textual contents (e.g., titles, surrounding texts, captions) using object detection and optical character recognition (OCR) techniques.We collected image chart datasets such as LineCap (Mahinpei et al., 2022) and Neural Caption Generation (Spreafico and Carenini, 2020) since they provide high-quality summaries.We also used the Pew dataset from (Shankar et al., 2022) and further augmented it by an crawling additional 1K charts.Finally, we used the ExcelChart400K dataset (Luo et al., 2021) which only provides bounding boxes without underlying data tables.We also considered other existing image chart datasets such as Vis30K (Chen et al., 2021) and VisImage (Deng et al., 2020), but they are not suitable as they usually have poor resolution and lack meaningful textual content (e.g., titles).

Augmentation by Knowledge Distillation
for Chart-to-text Generation Tasks Chart-related downstream tasks such as chart summarization (Shankar et al., 2022) and open-ended question answering (Kantharaj et al., 2022) require generating informative and relevant texts.However, for most of the charts in the pretraining corpus, there are either no associated summaries or the summaries that are collected opportunistically such as the Statista dataset (Shankar et al., 2022) lack quality (e.g., too short and not very informative).Training on such substandard "groundtruth" summaries can negatively affect the overall model performance as shown in text summarization (Kryscinski et al., 2019;Clark et al., 2021).Indeed, Goyal et al. (2022) and Liu et al. (2023b) have recently shown that human raters prefer summaries generated by LLMs, especially the ones that are instruction-tuned such as InstructGPT (Ouyang et al., 2022), compared to the reference summaries in various text summarization datasets.Consequently, the instruction-tuned LLMs have been successfully used as a annotator in several recent studies (DING et al., 2023;Qin et al., 2023).
Inspired by these findings, we leveraged Instruct-GPT to generate coherent and relevant text.Specifically, we prompted text-davinci-003 by providing the underlying data table as input and one exemplar (i.e., 1-shot in-context learning).Since generating summaries for thousands of charts by calling OpenAI API is quite costly, we devised a knowledge distillation approach.We first used text-davinci-003 to create a small dataset of 3700 summaries for different chart types.Next, we finetuned Flan-T5 XL (Chung et al., 2022) on this dataset.Finally, we utilized the finetuned Flan-T5 model to generate summaries for charts that do not have an associated summary.More details about this approach can be found in Appendix A.2.

Datasets Analysis
Our chart pretraining corpus has over 611K charts covering a diverse range of bar charts, line charts, and pie charts (Table 4).Data tables of Simple charts have two columns (simple bar charts or single-series line charts), whereas Complex charts involve at least three columns (e.g., stacked or group bar charts, line charts with multiple lines).The first two chart groups in Table 4 come with an underlying data table which cover over 80% of the corpus.The bottom group contains five datasets which only provide charts in image format without a data table 4 and cover about 20% of the corpus.Bar charts make up the majority portion (58.51%), followed by line charts (32.94%) and pie charts (9.39%).About 60% of the charts have multiple columns in their data tables, while 40% of the charts have only two columns. 5The corpus also covers a diverse range of topics including technology, economy, politics, health, and society.To ensure a fair evaluation, we excluded charts found in the validation and test sets of the downstream tasks from our pretraining corpus.Details about the linguistics of the corpus textual elements can be found in Appendix A.3.

Method
We propose UniChart, a unified pretrained model for chart comprehension and reasoning.This section first introduces the UniChart architecture followed by its pretraining objectives.

Model Architecture
UniChart consists of two main modules: a chart image encoder and a text decoder as shown in Fig. 1.
• Chart Image Encoder In order to effectively encode a chart image, an encoder needs to identify and interpret three different types of chart components: (1) textual elements (axis labels and legends), (2) visual elements (e.g., bars, lines), and (3) the layout that arranges textual and visual elements within a chart.Since this has a similarity with document image (e.g., receipts) understanding, our chart image encoder builds upon the encoder of one of the recent state-of-the-art document image understanding models, Donut (Kim et al., 2022).Donut offers an OCR-free architecture.The model is pretrained using an OCR-pseudo task, where it sequentially generates the encoded text in a document image, following the order from the top-left corner to the bottom-right corner of the image.As a result, we did not have to run an external OCR module like CRAFT (Baek et al., 2019) and Parseq (Bautista and Atienza, 2022), which improved time and memory efficiency throughout our training pipeline.Donut employs Swin Transformer (Liu et al., 2021) architecture as the image encoder.To encode the chart image features, the images are split into non-overlapping patches, which are then processed using shifted windowbased multi-headed self-attention and MLP layers to produce the image embeddings.
• Text Decoder Similar to Donut (Kim et al., 2022), we use the BART (Lewis et al., 2019) decoder for generating the output.The textual (taskspecific) prompts are fed to the decoder and the decoder has to generate the output by conditioning on the prompted context (see Fig. 1).

Pretraining Objectives
Our pretraining objectives include low-level tasks that are more focused on retrieving the underlying data from the chart images and high-level tasks that align closely with the downstream tasks.stored as bitmap images without access to the underlying data.It is important to learn how to recover data values when the chart data is not available.Therefore, we also introduce the data value estimation task, in which the model is asked to generate the scale of the graphical marks (e.g., bars, line points) as a percentage of the chart plot area.
We obtain these scales by dividing the bars or line points heights (bounding boxes) by the height of the chart plot area and rounding the result to two decimal places.At the final stage, we use charts for which both data tables and object bounding boxes are available as well as charts for which at least the bounding box annotations are available, e.g., ExcelCharts from (Luo et al., 2021).
• Numerical & Visual Reasoning Many downstream applications over charts may involve numerical and visual reasoning with the chart elements such as chart QA and summarization.For example, the model may need to apply a series of mathematical and logical operations such as addition, subtraction and comparisons to answer a question.
To inject such reasoning skills into the model, we design template-based numerical reasoning tasks where the model is trained to execute/perform the most common mathematical operations over the chart data values.We manually analyzed the existing task datasets (e.g., ChartQA) to find the most common operations (e.g., sum, average, difference, etc.) and constructed 90 templates that we utilize to generate synthetic question and answer pairs.All the templates are provided in Appendix A.8.
• Open-ended Question Answering It is very common for users to ask open-ended questions over charts (Kantharaj et al., 2022).Such questions often ask for answers that require high-level reasoning and explanations.To improve the capability of the model in answering open-ended questions, we follow previous work (Shi et al., 2022) to generate synthetic open-ended QA pairs.Specifically, a T5 model (Raffel et al., 2020) pretrained on SQuAD (Rajpurkar et al., 2016) is employed to generate an open-ended question for each summary.The sen- tence containing the answer in the summary then serves as the answer to its generated question.
• Chart Summarization Image captioning is a fundamental problem in AI in which the machines need to summarize the main content of the image in the textual form.This task has been studied extensively (Vinyals et al., 2015;Herdade et al., 2019;Hu et al., 2021;Li et al., 2022).We follow previous work (Vinyals et al., 2015;Xia et al., 2021) to pretrain our model on this task to further enhance the model's capability in generating textual descriptions from the chart image.As discussed in §3.3, we used mostly the summaries generated from GPT models provided by OpenAI either directly or through a knowledge distillation step.

Downstream Tasks
In addition to zero-shot evaluation, we also adapt UniChart by finetuning it on a downstream task.We consider four downstream tasks: (1) Factoid Chart Question Answering: we use ChartQA (Masry et al., 2022), which is a benchmark consisting of factoid question-answer pairs for charts with a particular focus on visual and logical reasoning questions; (2) Complex Chart Question Answering: we consider OpenCQA (Kantharaj et al., 2022), another QA benchmark in which answers are explanatory descriptions; (3) Chart Summarization: we use Chart-to-Text (Shankar et al., 2022), a large-scale benchmark for chart summarization; (4) Chart-to-Table : we use ChartQA for both finetuning and evaluation.Moreover, we evaluate the pretrained model in a zero-shot setup on the We-bCharts dataset (Choi et al., 2019), a collection of 300 charts obtained from the web.

Experiments Setup
To minimize the computational resource requirements, we initialize our model from the base Donut weights (Kim et al., 2022).Our pretraining process consists of two stages.In the first stage, we set the input image resolution to 512x512 and pretrain for 300K steps.In the second stage, we increase the input image resolution to 960x960 and pretrain for an additional 100K steps.Table 6 shows the hyperparameters we used in pretraining and finetuning our model on each downstream task.All our experiments were carried out using one 4-A100 (40GB), one 4-A100 (80GB), and one 4-V100 (32 GB) GPU machines.(Shankar et al., 2022) and ChartQA (Masry et al., 2022).
To evaluate our approach, we follow previous works (Lee et al., 2022;Shankar et al., 2022;Masry et al., 2022;Kantharaj et al., 2022;Liu et al., 2022b) and utilize Relaxed Accuracy (RA) for ChartQA and BLEU (Post, 2018) for textgeneration tasks (Chart-to-Text and OpenCQA).However, the BLEU score has limitations as it primarily focuses on n-gram matching between the generated and reference texts, overlooking important factors such as semantic similarity, informativeness, and factual correctness (Goyal et al., 2022).Therefore, we conduct a human evaluation and ChatGPT-driven study to assess and compare these crucial aspects in the outputs of different models ( §5.3).Finally, we use Relative Number Set Similarity (RNSS) (Masry et al., 2022)  metrics to evaluate the Chart-to-Table task.

Main Results
As shown in Table 2, UniChart outperforms previous state-of-the-art models, MatCha and VL-T5, on the ChartQA and Chart-to-Text (Pew) datasets, although it shows slightly lower performance on Chart-to-Text (Statista).The performance gap is more prominent on the challenging human-written questions in the ChartQA benchmark (Masry et al., 2022), where our model's pretraining objectives tailored to visual and numerical reasoning give it a significant advantage.UniChart also achieved a higher BLUE score compared to the SoTA VL-T5 model on OpenCQA benchmark, which demonstrates our model's capability in generating explanatory answers for questions about charts.Finally, UniChart surpasses MatCha's performance on two datasets, demonstrating its generalizability across diverse visual styles, even in a zero-shot setup on unseen charts (WebCharts).Overall, these results establish UniChart as the SoTA model for chart comprehension and reasoning tasks.
To further assess the impact of our different pretraining objectives on our model's performance, we conducted ablation studies.We observe that removing various pertaining objectives led to a slight decrease in performance (Table 8).The decrease in performance is particularly noticeable when the Numerical Reasoning pretaining task is removed, highlighting the importance of this task in imbuing numerical abilities into our model.More details of this experiment can be found in Appendix A.4.

Human and ChatGPT Evaluation
As discussed in §5.1, reference-based metrics like BLEU have relatively low correlations with human judgments (Belz and Reiter, 2006;Tan et al., 2015;Liu et al., 2023a), and generated texts with very high such scores can be of a very poor quality (Smith et al., 2016).Therefore, we decided to conduct a human evaluation to measure the quality of summaries generated by different models.We focus on following criteria in the chart summarization task:(1) Informativeness; (2) Factual Correctness; and(3) Semantic Levels that characterize the content of the summary.More details about the criteria can be found in Appendix A.5.
We randomly picked 150 sample charts from Chart2text Statista test split and asked 3 human annotators to rate four summaries for each chart based on informativeness out of 1 to 5. The order of exposure of summaries to the annotator was randomized to avoid any potential bias.Summaries for each chart were rated by one annotator except for the first 100 charts for which we had two annotators to measure the agreement.We computed Krippendorff's alpha (Krippendorff, 2011) to measure inter-annotator agreement and found a moderate level of agreement with an alpha coefficient of 0.54.We further utilize ChatGPT for evaluating the same 150 samples, as LLMs have demonstrated their effectiveness as evaluators for text generation tasks (Luo et al., 2023;Liu et al., 2023a;Gao et al., 2023;Fu et al., 2023).We define the informativeness criteria and rating scheme to ChatGPT and then employ ChatGPT to generate evaluation steps.We then send these evaluation steps along with the data table of the chart and the summary to ChatGPT to obtain ratings (see Appendix A.5 for details).
Table 3 shows the result of human evaluation on chart summarization based on informativeness criteria.We notice that annotators preferred ZeroShot version of our model which generates summaries that are more similar to those generated by GPT, rather than gold summaries.The finetuned version of UniChart was also rated higher compared to SoTA MatCha (Liu et al., 2022b).The finetuned UniChart model also produces fewer factual errors compared to Matcha and the ZeroShot version (Appendix A.5 and Table 7).We observe that the ratings provided by ChatGPT are roughly consistent with the human annotators' scores in terms of informativeness criteria.Moreover, we conducted a statistical test (p-value) for ratings from humans and ChatGPT, with the null hypothesis that the ratings are two independent samples.The p-values in each row in Table 3 demonstrate that it is very infrequent that two rating samples are independent based on the observed ratings.Also in terms of different semantic contents, the ZeroShot model tends to contain more sentences with high-level visual patterns and trends.A previous study finds that such high-level insights lead to more reader takeaways compared to the text describing lowlevel visual encodings like axes and colors (Stokes et al., 2022).Overall, the results above suggest that UniChart model's summaries are more informative with high-level insights and factually accurate than the SoTA (MatCha).

Time and Memory Efficiency
UniChart exhibits significant time efficiency compared to MatCha, as shown in Fig. 4. The gap in speed is more evident on tasks that require the generation of long output sequences (e.g., Chart-to-Text).This difference in speed can be attributed to MatCha's use of a long input sequence (4K) with a quadratic increase in complexity while UniChart's vision encoder relies on sliding windows with a local attention mechanism that scales linearly with the input image size.Moreover, UniChart boasts a smaller parameter count (201M) compared to MatCha (282M), further contributing to its efficiency.As a result, UniChart is highly suitable for real-world applications that prioritize fast inference speeds.More details are provided in Appendix A.7.

Error Analysis and Challenges
We conducted a manual analysis of our model's outputs to identify key challenges faced by existing models.
• Densely populated charts: Our model struggles with extracting insights from chart images that contain numerous data elements densely packed in a limited area.This is evident in Figure Fig. 9 (Q3) where our model generates a hallucinated summary due to the complexity of the chart.Increasing model parameters and input image resolution could potentially improve performance in these cases.
• Numerical reasoning: Despite efforts to incorporate mathematical skills, our model still encounters difficulties with complex arithmetic calculations (Q2 in Fig. 9).Addressing this challenge involves decoupling arithmetic calculations and reasoning steps by employing external program executors that perform the calculations using the equa-tions generated by our model (Gao et al., 2022).
• Factual correctness in generated summaries: Factual correctness still poses a challenge for autoregressive language models (Lin et al., 2022;Ope-nAI, 2022;Zhao et al., 2023).Although our finetuned UniChart model produced fewer factual errors compared to MatCha (see Table 7), it still generates some incorrect statements (see Q4 in Fig. 9).This issue can be attributed to factual errors in the pretraining captions generated by ChatGPT.

Conclusion
We present UniChart, a general purpose pretrained model designed for a broad range of chart-related tasks.Our model incorporates chart-specific pretraining tasks and is trained on a large and diverse collection of charts and corresponding summaries collected opportunistically using LLMs.We conducted both human and ChatGPT evaluations to show the superiority of our method.While our model sets the state-of-the-art record on four different downstream tasks and showed improved time and memory efficiency, the evaluation also reveals opportunities for improvement.We believe that our model and pretraining data will be valuable resources for future research and encourage further exploration in this relatively new area.

Limitations
While UniChart exhibits state-of-the-art performance on several benchmarks, it suffers from several limitations.Despite the remarkable abilities on the ChartQA dataset, the model still struggles to answer questions that involve compositional mathematical operations.Moreover, we have noticed that the model may hallucinate and produce factually incorrect statements on the text generation tasks such as Chart-to-Text and OpenCQA.
Despite the generalizability of our model on unseen chart image styles (WebCharts), there's still a noticeable drop in performance compared to the performance on the tasks on which the model is finetuned (e.g., ChartQA).Hence, there's still a need for better generalizable chart models for the diverse charts on the Web.One direction is to enlarge our pretraining datasets by crawling millions of chart images from the Web.Since most charts on the Web do not provide high-quality captions or the underlying data table, self-supervised pretraining objectives are needed to benefit from these charts.
Due to the limited computing resources, we did not investigate the effect hyperparameter tuning might have on the performance on the different downstream tasks.Also, although we have noticed the convergence of UniChart at the end of the second stage pretraining, we can not confirm whether further pretraining may improve the performance of our model.charts, the underlying data tables are not provided.However, we have observed that the underlying data values are written on the visual elements in most of these charts.Hence, we decided to use an OCR tool to extract the layout-preserving texts from the chart images, and then feed it into Chat-GPT to generate the summaries as shown in Fig. 6.We realized that ChatGPT is capable of understanding a chart from the OCR data.

A.3 Dataset Analysis
The linguistic characteristics of the textual elements vary across different datasets, with charts from PlotQA and PewResearch often having longer text elements (e.g., axis labels, legends, titles), while augmented data and Beagle datasets contain shorter text (Table 4, right).In Table 5, we further provide linguistic statistics for the summaries of the datasets used in the summary generation task at pretraining.

A.4 Ablation study
To further assess the impact of our different pretraining objectives on our model's performance, we conducted ablation studies.Due to computational limitations, we focused on pretraining the model only on the lower image size (512x512) and compared it against the corresponding main model (512x512).From Table 8, we observe that removing the Chart Summarization or Open-ended Question Answering objectives led to a slight decrease in performance on ChartQA.We attributed this to the abundance of numerical reasoning examples in pretraining.However, removing the Numerical Reasoning pretaining task led to a substantial decrease in performance on ChartQA, highlighting the importance of this task in imbuing numerical abilities into our model.Pretraining the model without the Data Table Generation objective resulted in a relatively weak performance in the ChartQA benchmark, underscoring the importance of understanding underlying data tables of charts in answering reasoning questions.

A.5 Human and ChatGPT Evaluation
As discussed in section §5.3, we evaluate the following three criteria in the human evaluation study: (1) Informativeness which measures how much information from the chart the summary covers.Ideally, an informative summary should contain highlevel insights from the chart, such as important patterns, trends, and outliers in data; (2) Factual    The chart shows the percentage of U.S. teens, ages 13 to 17, who feel overwhelmed by drama, excluded by friends, pressured to post content for likes, or worse about their life due to social media.Girls are more likely than boys to feel overwhelmed by drama, excluded by friends, and worse about their life.Around 45% of girls report feeling overwhelmed by drama, while only 32% of boys feel the same.Similarly, 37% of girls report feeling excluded by their friends compared to 24% of boys.Regarding pressure to post content for likes, 32% of boys and 27% of girls feel this pressure.Finally, 28% of girls and 18% of boys report feeling worse about their own life due to social media.Overall, the chart suggests that social media has a greater negative impact on the mental health of teen girls than teen boys. 1. Read the summary carefully and ensure understanding of the information presented in the chart data table.2. Evaluate the informativeness of the summary by assessing how much information from the data table is covered.3. Rate the informativeness of the summary on a scale of 1 to 5, with 1 being not informative at all and 5 being highly informative and containing all key insights.4. Consider whether the summary effectively conveys the key information from the chart data table and is factually accurate.5. Provide specific feedback on areas for improvement, if necessary.

ChatGPT
You will be given one summary written for a chart.Your task is to rate the summary on one metric and provide appropriate feedback.
Evaluation Criteria: Informativeness (1-5) -It measures how informative the generated summary is.In other words, informativeness measures how much information from the data table of the chart the summary covers.Ideally, an informative summary should contain high-level insights from the chart, such as important patterns, trends and outliers in data.

Evaluation steps:
This statistic shows the economic contribution of the travel and tourism industry to the United Kingdom ( UK ) in 2017, by sector.The direct contribution to GDP is approximately 93.5 billion U.S. dollars.

Initial Instruction Detailed Steps Generated By ChatGPT
Evaluation: Informativeness: 2/5 Feedback: The summary only covers one data point from the table, which is the direct contribution to GDP.It does not provide any information on the total contribution to GDP, visitor exports, domestic spending, leisure spending, business spending, or capital investment.The summary is not informative enough as it does not contain any key insights from the chart data table.The summary needs to include more information from the table in order to be more informative.

Figure 1 :
Figure 1: Our UniChart model with different pretraining objectives.The model consists of two main modules: Chart Image Encoder, and Text Decoder.Four different pretraining objectives are specified in different colors; data table generation, chart summarization, numerical and visual reasoning, and open-ended question answering.

Figure 3 :
Figure 3: An example of the performance of InstructGPT in generation summaries for data tables.On the left side, the red text is a full example of a demonstration and its summary followed by the demonstration for the target chart.The paragraph in green shows the summary generated by the model.

Figure 4 :
Figure 4: Average inference time for 10 random samples from three major benchmarks in chart understanding domain for UniChart and MatCha models

Figure 5 :
Figure 5: Examples of summaries generated by Flan-T5 XL model after fine-tuning.

Figure 6 :
Figure 6: An example of the layout-preserved OCR-extracted text for a PewResearch chart image where the underlying data table is not available.The extracted text is then given to ChatGPT to generate a summary.ChatGPT can still extract and comprehend important information and insights from the layout-preserving text of the chart image.

Figure 7 :Figure 8 :
Figure7: The pipeline designed for the ChatGPT Evaluation Experiment.First, we feed the task description followed by our desired criteria into ChatGPT in order to get detailed grading instructions.Then, the chart (underlying data table representation) and a sample summary are appended to the prompt which is fed again into ChatGPT to receive the feedback.

Table 1 :
Number of examples for each task in pretraining.

Table 2 :
Evaluation results on four public benchmarks: ChartQA, Chart-to-Text, OpenCQA, and Chart-to-Table.All the results are calculated after finetuning UniChart pretrained checkpoint except for WebCharts (zero-shot).

Table 4 :
Chart type distribution and linguistic statistics of the chart pertaining corpus.The charts in the last group (magenta) do not come with an underlying data table.The charts generated by the data augmentation process are shown in blue.

Table 5 :
Statistics about the captions of the datasets used in