2025
pdf
bib
abs
Train a Unified Multimodal Data Quality Classifier with Synthetic Data
Weizhi Wang
|
Rongmei Lin
|
Shiyang Li
|
Colin Lockard
|
Ritesh Sarkhel
|
Sanket Lokegaonkar
|
Jingbo Shang
|
Xifeng Yan
|
Nasser Zalmout
|
Xian Li
Findings of the Association for Computational Linguistics: EMNLP 2025
The Multimodal Large Language Models (MLLMs) are continually pre-trained on a mixture of image-text caption data and interleaved document data, while the high-quality data filtering towards image-text interleaved document data is under-explored. We propose to train an efficient MLLM as a Unified Mulitmodal Data Quality Classifier to Filter both high-quality image-text caption and interleaved data (UniFilter). To address the challenge of collecting diverse labeled multimodal data, we introduce a semi-synthetic approach that leverages readily available raw images and generates corresponding text across four quality levels. This method enables efficient creation of sample-score pairs for both caption and interleaved document data to train UniFilter. We apply UniFilter to curate high-quality caption data from DataComp caption dataset and interleaved data from the OBELICS image-text interleaved dataset. MLLMs pre-trained on the filtered data demonstrate significantly enhanced capabilities compared to those trained on baseline-filtered data, achieving stronger zero-shot reasoning and in-context learning capabilities. After visual supervised fine-tuning, these UniFilter-induced MLLMs achieve stronger performance on various benchmarks, highlighting the downstream benefits of high-quality multimodal pre-training.
pdf
bib
abs
Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training
Yuchen Zhuang
|
Jingfeng Yang
|
Haoming Jiang
|
Xin Liu
|
Kewei Cheng
|
Sanket Lokegaonkar
|
Yifan Gao
|
Qing Ping
|
Tianyi Liu
|
Binxuan Huang
|
Zheng Li
|
Zhengyang Wang
|
Pei Chen
|
Ruijie Wang
|
Rongzhi Zhang
|
Nasser Zalmout
|
Priyanka Nigam
|
Bing Yin
|
Chao Zhang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Due to the scarcity of agent-oriented pre-training data, LLM-based autonomous agents typically rely on complex prompting or extensive fine-tuning, which often fails to introduce new capabilities while preserving strong generalizability. We introduce Hephaestus-Forge, the first large-scale pre-training corpus designed to enhance the fundamental capabilities of LLM agents in API function calling, intrinsic reasoning and planning, and adapting to environmental feedback. Hephaestus-Forge comprises 103B agent-specific data encompassing 76,537 APIs, including both tool documentation to introduce knowledge of API functions and function calling trajectories to strengthen intrinsic reasoning. To explore effective training protocols, we investigate scaling laws to identify the optimal recipe in data mixing ratios. By continual pre-training on Hephaestus-Forge, Hephaestus outperforms small- to medium-scale open-source LLMs and rivals commercial LLMs on three agent benchmarks, demonstrating the effectiveness of our pre-training corpus in enhancing fundamental agentic capabilities and generalization of LLMs to new tasks or environments.
pdf
bib
abs
DocTalk: Scalable Graph-based Dialogue Synthesis for Enhancing LLM Conversational Capabilities
Jing Yang Lee
|
Hamed Bonab
|
Nasser Zalmout
|
Ming Zeng
|
Sanket Lokegaonkar
|
Colin Lockard
|
Binxuan Huang
|
Ritesh Sarkhel
|
Haodong Wang
Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Large Language Models (LLMs) are increasingly employed in multi-turn conversational tasks, yet their pre-training data predominantly consists of continuous prose, creating a potential mismatch between required capabilities and training paradigms. We introduce a novel approach to address this discrepancy by synthesizing conversational data from existing text corpora. We present a pipeline that transforms a cluster of multiple related documents into an extended multi-turn, multi-topic information-seeking dialogue. Applying our pipeline to Wikipedia articles, we curate DocTalk, a multi-turn pre-training dialogue corpus consisting of over 730k long conversations. We hypothesize that exposure to such synthesized conversational structures during pre-training can enhance the fundamental multi-turn capabilities of LLMs, such as context memory and understanding. Empirically, we show that incorporating DocTalk during pre-training results in up to 40% gain in context memory and understanding, without compromising base performance. DocTalk is available at https://huggingface.co/datasets/AmazonScience/DocTalk.