Steven Au

2026

MIDI-PHOR: Multi-View Distillation for Music Understanding and Captioning
Steven Au
Proceedings of the 4th Workshop on NLP for Music and Audio (NLP4MusA 2026)

A central limitation of current music understanding frameworks is the reliance on audio embeddings, which frequently yields interpretations lacking traceable ties to explicit musical elements such as notes, dynamics, and instrumentation. We address this gap with MIDIPHOR, a MIDI-first framework that converts symbolic data into structured, queryable representations for reasoning. MIDI-PHOR distills each piece into three complementary views: a symbolic view capturing pitch, meter, and key; a time-series (TS) view that tracks rhythmic salience, texture, and role activity; and an instrument-role graph encoding ensemble interactions. With evidence-linked claims, experiments demonstrate reduced hallucinations compared to raw-MIDI baselines and offer a robust, auditable bridge between symbolic data and semantic music understanding.

2025

pdf bib

2024

pdf bib abs

UCSC NLP at SemEval-2024 Task 10: Emotion Discovery and Reasoning its Flip in Conversation (EDiReF)
Neng Wan | Steven Au | Esha Ubale | Decker Krogh
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

We describe SemEval-2024 Task 10: EDiReF consisting of three sub-tasks involving emotion in conversation across Hinglish code-mixed and English datasets. Subtasks include classification of speaker emotion in multiparty conversations (Emotion Recognition in Conversation) and reasoning around shifts in speaker emotion state (Emotion Flip Reasoning). We deployed a BERT model for emotion recognition and two GRU-based models for emotion flip. Our model achieved F1 scores of 0.45, 0.79, and 0.68 for subtasks 1, 2, and 3, respectively.

Co-authors

Decker Krogh 1

Namyong Park 1

Ojasmitha Pedirappagari 1

Ryan A. Rossi 1

Esha Ubale 1

Neng Wan 1

Yu Wang 1

Venues

Fix author