ROSCO-Omni: Multimodal LLM-Based Communication Understanding for Non- and Minimally-Speaking Autistic Individuals

Siddhant Bikram Shah; Kristina T. Johnson

doi:10.18653/v1/2026.findings-acl.2011

ROSCO-Omni: Multimodal LLM-Based Communication Understanding for Non- and Minimally-Speaking Autistic Individuals

Siddhant Bikram Shah, Kristina T. Johnson

Abstract

Approximately 30% of autistic individuals remain non- or minimally-speaking throughout their lives, yet communicate richly through gestures, vocalizations, facial expressions, and augmentative devices. Interpreting this communication is an inherently multimodal task: caregivers rely on the simultaneous integration of visual cues, auditory signals, and contextual understanding to infer intent. Despite this natural alignment with multimodal large language models (MLLMs), research in this intersection remains narrowly focused on diagnosis rather than communication understanding. We address this gap by reframing the problem around two complementary dimensions: communicative actions (the physical modality) and communicative functions (the pragmatic intent). We analyze the ROSCO dataset, containing 2,903 caregiver-annotated video samples from 27 non- and minimally-speaking individuals, with multi-label annotations capturing up to three concurrent actions and two functions per sample across 6 action and 6 function classes. We further propose ROSCO-Omni, a teacher-student distillation framework that generates label-guided instruction data from a high-capability teacher MLLM and uses it to finetune a student MLLM for domain-specialized inference. ROSCO-Omni achieves performance comparable to closed-source models, demonstrating that open-source MLLMs can be adapted to understand communication in this underserved population.

Anthology ID:: 2026.findings-acl.2011
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 40453–40469
Language:
URL:: https://aclanthology.org/2026.findings-acl.2011/
DOI:: 10.18653/v1/2026.findings-acl.2011
Bibkey:
Cite (ACL):: Siddhant Bikram Shah and Kristina T. Johnson. 2026. ROSCO-Omni: Multimodal LLM-Based Communication Understanding for Non- and Minimally-Speaking Autistic Individuals. In Findings of the Association for Computational Linguistics: ACL 2026, pages 40453–40469, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: ROSCO-Omni: Multimodal LLM-Based Communication Understanding for Non- and Minimally-Speaking Autistic Individuals (Shah & Johnson, Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.2011.pdf
Checklist:: 2026.findings-acl.2011.checklist.pdf

PDF Cite Search Checklist Fix data