Yuang Ai
2025
InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning
Xiaotian Han
|
Yiren Jian
|
Xuefeng Hu
|
Haogeng Liu
|
Yiqi Wang
|
Qihang Fan
|
Yuang Ai
|
Huaibo Huang
|
Ran He
|
Zhenheng Yang
|
Quanzeng You
Findings of the Association for Computational Linguistics: EMNLP 2025
Pre-training on large, high-quality datasets is essential for improving the reasoning abilities of Large Language Models (LLMs), particularly in specialized fields like mathematics. However, the field of Multimodal LLMs (MLLMs) lacks a comprehensive, open-source dataset for mathematical reasoning. To fill this gap, we present InfiMM-WebMath-40B, a high-quality dataset of interleaved image-text documents. It consists of 24 million web pages, 85 million image URLs, and 40 billion text tokens, all carefully extracted and filtered from CommonCrawl. We outline our data collection and processing pipeline in detail. Models trained on InfiMM-WebMath-40B demonstrate strong performance in both text-only and multimodal settings, setting a new state-of-the-art on multimodal math benchmarks such as MathVerse and We-Math.
Search
Fix author
Co-authors
- Qihang Fan 1
- Xiaotian Han 1
- Ran He 1
- Xuefeng Hu 1
- Huaibo Huang 1
- show all...