VLASCD: A Visual Language Action Model for Simultaneous Chatting and Decision Making

Zuojin Tang; Bin Hu; Chenyang Zhao; De Ma; Gang Pan; Bin Liu

doi:10.18653/v1/2025.emnlp-main.468

VLASCD: A Visual Language Action Model for Simultaneous Chatting and Decision Making

Zuojin Tang, Bin Hu, Chenyang Zhao, De Ma, Gang Pan, Bin Liu

Abstract

Recent large pretrained models such as LLMs (e.g., GPT series) and VLAs (e.g., OpenVLA) have achieved notable progress on multimodal tasks, yet they are built upon a multi-input single-output (MISO) paradigm. We show that this paradigm fundamentally limits performance in multi-input multi-output (MIMO) scenarios, where parallel task execution is required. In MISO architectures, tasks compete for a shared output channel, creating mutual exclusion effects that cause unbalanced optimization and degraded performance. To address this gap, we introduce MIMO-VLA (VLASCD), a unified training framework that enables concurrent multi-task outputs, exemplified by simultaneous dialogue generation and decision-making. Inspired by human cognition, MIMO-VLA eliminates interference between tasks and supports efficient parallel processing. Experiments on the CARLA autonomous driving platform demonstrate that MIMO-VLA substantially outperforms state-of-the-art MISO-based LLMs, reinforcement learning models, and VLAs in MIMO settings, establishing a new direction for multimodal and multitask learning.

Anthology ID:: 2025.emnlp-main.468
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9212–9232
Language:
URL:: https://aclanthology.org/2025.emnlp-main.468/
DOI:: 10.18653/v1/2025.emnlp-main.468
Bibkey:
Cite (ACL):: Zuojin Tang, Bin Hu, Chenyang Zhao, De Ma, Gang Pan, and Bin Liu. 2025. VLASCD: A Visual Language Action Model for Simultaneous Chatting and Decision Making. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 9212–9232, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: VLASCD: A Visual Language Action Model for Simultaneous Chatting and Decision Making (Tang et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.468.pdf
Checklist:: 2025.emnlp-main.468.checklist.pdf

PDF Cite Search Checklist Fix data