AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model

Seungwhan Moon; Andrea Madotto; Zhaojiang Lin; Tushar Nagarajan; Matt Smith; Shashank Jain; Chun-Fu Yeh; Prakash Murugesan; Peyman Heidari; Yue Liu; Kavya Srinet; Babak Damavandi; Anuj Kumar

doi:10.18653/v1/2024.emnlp-industry.98

AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model

Seungwhan Moon, Andrea Madotto, Zhaojiang Lin, Tushar Nagarajan, Matt Smith, Shashank Jain, Chun-Fu Yeh, Prakash Murugesan, Peyman Heidari, Yue Liu, Kavya Srinet, Babak Damavandi, Anuj Kumar

Abstract

We present Any-Modality Augmented Language Model (AnyMAL), a unified model that reasons over diverse input modality signals (i.e. text, image, video, audio, IMU motion sensor), and generates textual responses. AnyMAL inherits the powerful text-based reasoning abilities of the state-of-the-art LLMs including Llama-3 (70B), and converts modality-specific signals to the joint textual space through a pre-trained aligner module.In this paper, we provide details on the optimizations implemented to efficiently scale the training pipeline, and present a comprehensive recipe for model and training configurations. We conduct comprehensive empirical analysis comprising both human and automatic evaluations, and demonstrate state-of-the-art performance on various multimodal tasks compared to industry-leading models – albeit with a relatively small number of trainable parameters.

Anthology ID:: 2024.emnlp-industry.98
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:: November
Year:: 2024
Address:: Miami, Florida, US
Editors:: Franck Dernoncourt, Daniel Preoţiuc-Pietro, Anastasia Shimorina
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1314–1332
Language:
URL:: https://aclanthology.org/2024.emnlp-industry.98/
DOI:: 10.18653/v1/2024.emnlp-industry.98
Bibkey:
Cite (ACL):: Seungwhan Moon, Andrea Madotto, Zhaojiang Lin, Tushar Nagarajan, Matt Smith, Shashank Jain, Chun-Fu Yeh, Prakash Murugesan, Peyman Heidari, Yue Liu, Kavya Srinet, Babak Damavandi, and Anuj Kumar. 2024. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1314–1332, Miami, Florida, US. Association for Computational Linguistics.
Cite (Informal):: AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model (Moon et al., EMNLP 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.emnlp-industry.98.pdf

PDF Cite Search Fix data