Chun-Fu Yeh


2024

pdf bib
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
Seungwhan Moon | Andrea Madotto | Zhaojiang Lin | Tushar Nagarajan | Matt Smith | Shashank Jain | Chun-Fu Yeh | Prakash Murugesan | Peyman Heidari | Yue Liu | Kavya Srinet | Babak Damavandi | Anuj Kumar
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

We present Any-Modality Augmented Language Model (AnyMAL), a unified model that reasons over diverse input modality signals (i.e. text, image, video, audio, IMU motion sensor), and generates textual responses. AnyMAL inherits the powerful text-based reasoning abilities of the state-of-the-art LLMs including Llama-3 (70B), and converts modality-specific signals to the joint textual space through a pre-trained aligner module.In this paper, we provide details on the optimizations implemented to efficiently scale the training pipeline, and present a comprehensive recipe for model and training configurations. We conduct comprehensive empirical analysis comprising both human and automatic evaluations, and demonstrate state-of-the-art performance on various multimodal tasks compared to industry-leading models – albeit with a relatively small number of trainable parameters.