Chart2Code53: A Large-Scale Diverse and Complex Dataset for Enhancing Chart-to-Code Generation

Tianhao Niu; Yiming Cui; Baoxin Wang; Xiao Xu; Xin Yao; Qingfu Zhu; Dayong Wu; Shijin Wang; Wanxiang Che

doi:10.18653/v1/2025.emnlp-main.799

Chart2Code53: A Large-Scale Diverse and Complex Dataset for Enhancing Chart-to-Code Generation

Tianhao Niu, Yiming Cui, Baoxin Wang, Xiao Xu, Xin Yao, Qingfu Zhu, Dayong Wu, Shijin Wang, Wanxiang Che

Abstract

Chart2code has recently received significant attention in the multimodal community due to its potential to reduce the burden of visualization and promote a more detailed understanding of charts. However, existing Chart2code-related training datasets suffer from at least one of the following issues: (1) limited scale, (2) limited type coverage, and (3) inadequate complexity. To address these challenges, we seek more diverse sources that better align with real-world user distributions and propose dual data synthesis pipelines: (1) synthesize based on online plotting code. (2) synthesize based on chart images in the academic paper. We create a large-scale Chart2code training dataset Chart2code53, including 53 chart types, 130K Chart-code pairs based on the pipeline. Experimental results demonstrate that even with few parameters, the model finetuned on Chart2code53 achieves state-of-the-art performance on multiple Chart2code benchmarks within open-source models.

Anthology ID:: 2025.emnlp-main.799
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 15828–15844
Language:
URL:: https://aclanthology.org/2025.emnlp-main.799/
DOI:: 10.18653/v1/2025.emnlp-main.799
Bibkey:
Cite (ACL):: Tianhao Niu, Yiming Cui, Baoxin Wang, Xiao Xu, Xin Yao, Qingfu Zhu, Dayong Wu, Shijin Wang, and Wanxiang Che. 2025. Chart2Code53: A Large-Scale Diverse and Complex Dataset for Enhancing Chart-to-Code Generation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15828–15844, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Chart2Code53: A Large-Scale Diverse and Complex Dataset for Enhancing Chart-to-Code Generation (Niu et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.799.pdf
Checklist:: 2025.emnlp-main.799.checklist.pdf

PDF Cite Search Checklist Fix data