Tushar Nagarajan
2024
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
Seungwhan Moon
|
Andrea Madotto
|
Zhaojiang Lin
|
Tushar Nagarajan
|
Matt Smith
|
Shashank Jain
|
Chun-Fu Yeh
|
Prakash Murugesan
|
Peyman Heidari
|
Yue Liu
|
Kavya Srinet
|
Babak Damavandi
|
Anuj Kumar
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
We present Any-Modality Augmented Language Model (AnyMAL), a unified model that reasons over diverse input modality signals (i.e. text, image, video, audio, IMU motion sensor), and generates textual responses. AnyMAL inherits the powerful text-based reasoning abilities of the state-of-the-art LLMs including Llama-3 (70B), and converts modality-specific signals to the joint textual space through a pre-trained aligner module.In this paper, we provide details on the optimizations implemented to efficiently scale the training pipeline, and present a comprehensive recipe for model and training configurations. We conduct comprehensive empirical analysis comprising both human and automatic evaluations, and demonstrate state-of-the-art performance on various multimodal tasks compared to industry-leading models – albeit with a relatively small number of trainable parameters.
Search
Co-authors
- Seungwhan Moon 1
- Andrea Madotto 1
- Zhaojiang Lin 1
- Matt Smith 1
- Shashank Jain 1
- show all...