MIDI-PHOR: Multi-View Distillation for Music Understanding and Captioning

Steven Au

MIDI-PHOR: Multi-View Distillation for Music Understanding and Captioning

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

Text-only training is a promising new method for training multimodal machine learning models without data from every modality. However, few studies have explored its use as an approximation of missing data for supervised learning in data-scarce environments. In this work, we examine techniques to acquire text-based training data, address the modality gap, and present a case study on classifying subjective audio timbre descriptions based on three kinds of text-only training data and six augmentation methods on eight audio-timbre datasets. We find text-only training successfully trains supervised audio classifiers without audio that are able to compete with a zero-shot baseline and training on real audio.

Anthology ID:: 2026.nlp4musa-1.6
Volume:: Proceedings of the 4th Workshop on NLP for Music and Audio (NLP4MusA 2026)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Elena V. Epure, Sergio Oramas, SeungHeon Doh, Pedro Ramoneda, Anna Kruspe, Mohamed Sordo
Venues:: NLP4MusA | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 33–43
Language:
URL:: https://aclanthology.org/2026.nlp4musa-1.6/
DOI:
Bibkey:
Cite (ACL):: Steven Au. 2026. MIDI-PHOR: Multi-View Distillation for Music Understanding and Captioning. In Proceedings of the 4th Workshop on NLP for Music and Audio (NLP4MusA 2026), pages 33–43, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: MIDI-PHOR: Multi-View Distillation for Music Understanding and Captioning (Au, NLP4MusA 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.nlp4musa-1.6.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{au-2026-midi,
    title = "{MIDI}-{PHOR}: Multi-View Distillation for Music Understanding and Captioning",
    author = "Au, Steven",
    editor = "Epure, Elena V.  and
      Oramas, Sergio  and
      Doh, SeungHeon  and
      Ramoneda, Pedro  and
      Kruspe, Anna  and
      Sordo, Mohamed",
    booktitle = "Proceedings of the 4th Workshop on {NLP} for Music and Audio ({NLP}4{M}us{A} 2026)",
    month = mar,
    year = "2026",
    address = "Rabat, Morocco",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2026.nlp4musa-1.6/",
    pages = "33--43",
    ISBN = "979-8-89176-369-2",
    abstract = "Text-only training is a promising new method for training multimodal machine learning models without data from every modality. However, few studies have explored its use as an approximation of missing data for supervised learning in data-scarce environments. In this work, we examine techniques to acquire text-based training data, address the modality gap, and present a case study on classifying subjective audio timbre descriptions based on three kinds of text-only training data and six augmentation methods on eight audio-timbre datasets. We find text-only training successfully trains supervised audio classifiers without audio that are able to compete with a zero-shot baseline and training on real audio."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="au-2026-midi">
    <titleInfo>
        <title>MIDI-PHOR: Multi-View Distillation for Music Understanding and Captioning</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Steven</namePart>
        <namePart type="family">Au</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2026-03</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the 4th Workshop on NLP for Music and Audio (NLP4MusA 2026)</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Elena</namePart>
            <namePart type="given">V</namePart>
            <namePart type="family">Epure</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Sergio</namePart>
            <namePart type="family">Oramas</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">SeungHeon</namePart>
            <namePart type="family">Doh</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Pedro</namePart>
            <namePart type="family">Ramoneda</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Anna</namePart>
            <namePart type="family">Kruspe</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Mohamed</namePart>
            <namePart type="family">Sordo</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">Rabat, Morocco</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
        <identifier type="isbn">979-8-89176-369-2</identifier>
    </relatedItem>
    <abstract>Text-only training is a promising new method for training multimodal machine learning models without data from every modality. However, few studies have explored its use as an approximation of missing data for supervised learning in data-scarce environments. In this work, we examine techniques to acquire text-based training data, address the modality gap, and present a case study on classifying subjective audio timbre descriptions based on three kinds of text-only training data and six augmentation methods on eight audio-timbre datasets. We find text-only training successfully trains supervised audio classifiers without audio that are able to compete with a zero-shot baseline and training on real audio.</abstract>
    <identifier type="citekey">au-2026-midi</identifier>
    <location>
        <url>https://aclanthology.org/2026.nlp4musa-1.6/</url>
    </location>
    <part>
        <date>2026-03</date>
        <extent unit="page">
            <start>33</start>
            <end>43</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T MIDI-PHOR: Multi-View Distillation for Music Understanding and Captioning
%A Au, Steven
%Y Epure, Elena V.
%Y Oramas, Sergio
%Y Doh, SeungHeon
%Y Ramoneda, Pedro
%Y Kruspe, Anna
%Y Sordo, Mohamed
%S Proceedings of the 4th Workshop on NLP for Music and Audio (NLP4MusA 2026)
%D 2026
%8 March
%I Association for Computational Linguistics
%C Rabat, Morocco
%@ 979-8-89176-369-2
%F au-2026-midi
%X Text-only training is a promising new method for training multimodal machine learning models without data from every modality. However, few studies have explored its use as an approximation of missing data for supervised learning in data-scarce environments. In this work, we examine techniques to acquire text-based training data, address the modality gap, and present a case study on classifying subjective audio timbre descriptions based on three kinds of text-only training data and six augmentation methods on eight audio-timbre datasets. We find text-only training successfully trains supervised audio classifiers without audio that are able to compete with a zero-shot baseline and training on real audio.
%U https://aclanthology.org/2026.nlp4musa-1.6/
%P 33-43

Download as File

Markdown (Informal)

[MIDI-PHOR: Multi-View Distillation for Music Understanding and Captioning](https://aclanthology.org/2026.nlp4musa-1.6/) (Au, NLP4MusA 2026)

MIDI-PHOR: Multi-View Distillation for Music Understanding and Captioning (Au, NLP4MusA 2026)

ACL

Steven Au. 2026. MIDI-PHOR: Multi-View Distillation for Music Understanding and Captioning. In Proceedings of the 4th Workshop on NLP for Music and Audio (NLP4MusA 2026), pages 33–43, Rabat, Morocco. Association for Computational Linguistics.