Beyond Transcripts: A Renewed Perspective on Audio Chaptering

Fabian Retkowski; Maike Züfle; Thai-Binh Nguyen; Jan Niehues; Alex Waibel

Beyond Transcripts: A Renewed Perspective on Audio Chaptering

Fabian Retkowski, Maike Züfle, Thai Binh Nguyen, Jan Niehues, Alexander Waibel

Abstract

Audio chaptering, the task of automatically segmenting long-form audio into coherent sections, is increasingly important for navigating podcasts, lectures, and videos. Despite its relevance, research remains limited and text-based, leaving key questions unresolved about leveraging audio information, handling ASR errors, and transcript-free evaluation. We address these gaps through three contributions: (1) a systematic comparison between text-based models with acoustic features, a novel audio-only architecture (AudioSeg) operating on learned audio representations, and multimodal LLMs; (2) empirical analysis of factors affecting performance, including transcript quality, acoustic features, duration, and speaker composition; and (3) formalized evaluation protocols contrasting transcript-dependent text-space protocols with transcript-invariant time-space protocols. Our experiments on YTSeg reveal that AudioSeg substantially outperforms text-based approaches, pauses provide the largest acoustic gains, and current MLLMs struggle due to context limitations and weak instruction following.

Anthology ID:: 2026.acl-long.396
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8765–8787
Language:
URL:: https://aclanthology.org/2026.acl-long.396/
DOI:
Bibkey:
Cite (ACL):: Fabian Retkowski, Maike Züfle, Thai Binh Nguyen, Jan Niehues, and Alexander Waibel. 2026. Beyond Transcripts: A Renewed Perspective on Audio Chaptering. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8765–8787, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Beyond Transcripts: A Renewed Perspective on Audio Chaptering (Retkowski et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.396.pdf
Checklist:: 2026.acl-long.396.checklist.pdf

PDF Cite Search Checklist Fix data