Diversity is the Key: Enhancing LLM-based Post-processing for Automated Audio Captioning

Seyed Ali Farokh; Mohammad Mehdi Homayounpour; Ahmad Nickabadi

Diversity is the Key: Enhancing LLM-based Post-processing for Automated Audio Captioning

Seyed Ali Farokh, Mohammad Mehdi Homayounpour, Ahmad Nickabadi

Abstract

Automated Audio Captioning (AAC) is a multimodal task aimed at generating natural language descriptions of audio content. Previous studies have shown that LLMs can improve AAC performance by summarizing audio events based on a list of candidate captions, which are selected by an external reranker from those generated using Nucleus Sampling. However, the reranking process often selects overly similar captions, disregarding the original diversity of the sampled captions. In this work, we show that this diversity reflects the AAC model’s level of certainty and propose a lightweight candidate selection approach that preserves the initial diversity of the generated captions. This, in turn, enables an LLM to summarize the captions while considering the AAC model’s certainty in a few-shot setting. Experimental results demonstrate that our method outperforms previous post-processing techniques while being significantly faster.

Anthology ID:: 2025.rocling-main.10
Volume:: Proceedings of the 37th Conference on Computational Linguistics and Speech Processing (ROCLING 2025)
Month:: November
Year:: 2025
Address:: National Taiwan University, Taipei City, Taiwan
Editors:: Kai-Wei Chang, Ke-Han Lu, Chih-Kai Yang, Zhi-Rui Tam, Wen-Yu Chang, Chung-Che Wang
Venue:: ROCLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 87–94
Language:
URL:: https://aclanthology.org/2025.rocling-main.10/
DOI:
Bibkey:
Cite (ACL):: Seyed Ali Farokh, Mohammad Mehdi Homayounpour, and Ahmad Nickabadi. 2025. Diversity is the Key: Enhancing LLM-based Post-processing for Automated Audio Captioning. In Proceedings of the 37th Conference on Computational Linguistics and Speech Processing (ROCLING 2025), pages 87–94, National Taiwan University, Taipei City, Taiwan. Association for Computational Linguistics.
Cite (Informal):: Diversity is the Key: Enhancing LLM-based Post-processing for Automated Audio Captioning (Farokh et al., ROCLING 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.rocling-main.10.pdf

PDF Cite Search Fix data