MIDI-PHOR: Multi-View Distillation for Music Understanding and Captioning

Steven Au


Abstract
Text-only training is a promising new method for training multimodal machine learning models without data from every modality. However, few studies have explored its use as an approximation of missing data for supervised learning in data-scarce environments. In this work, we examine techniques to acquire text-based training data, address the modality gap, and present a case study on classifying subjective audio timbre descriptions based on three kinds of text-only training data and six augmentation methods on eight audio-timbre datasets. We find text-only training successfully trains supervised audio classifiers without audio that are able to compete with a zero-shot baseline and training on real audio.
Anthology ID:
2026.nlp4musa-1.6
Volume:
Proceedings of the 4th Workshop on NLP for Music and Audio (NLP4MusA 2026)
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Elena V. Epure, Sergio Oramas, SeungHeon Doh, Pedro Ramoneda, Anna Kruspe, Mohamed Sordo
Venues:
NLP4MusA | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
33–43
Language:
URL:
https://aclanthology.org/2026.nlp4musa-1.6/
DOI:
Bibkey:
Cite (ACL):
Steven Au. 2026. MIDI-PHOR: Multi-View Distillation for Music Understanding and Captioning. In Proceedings of the 4th Workshop on NLP for Music and Audio (NLP4MusA 2026), pages 33–43, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
MIDI-PHOR: Multi-View Distillation for Music Understanding and Captioning (Au, NLP4MusA 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.nlp4musa-1.6.pdf