Ahmad Nickabadi


2025

pdf bib
Diversity is the Key: Enhancing LLM-based Post-processing for Automated Audio Captioning
Seyed Ali Farokh | Mohammad Mehdi Homayounpour | Ahmad Nickabadi
Proceedings of the 37th Conference on Computational Linguistics and Speech Processing (ROCLING 2025)

Automated Audio Captioning (AAC) is a multimodal task aimed at generating natural language descriptions of audio content. Previous studies have shown that LLMs can improve AAC performance by summarizing audio events based on a list of candidate captions, which are selected by an external reranker from those generated using Nucleus Sampling. However, the reranking process often selects overly similar captions, disregarding the original diversity of the sampled captions. In this work, we show that this diversity reflects the AAC model’s level of certainty and propose a lightweight candidate selection approach that preserves the initial diversity of the generated captions. This, in turn, enables an LLM to summarize the captions while considering the AAC model’s certainty in a few-shot setting. Experimental results demonstrate that our method outperforms previous post-processing techniques while being significantly faster.