X-ACE: Explainable and Multi-factor Audio Captioning Evaluation

Qian Wang; Jia-Chen Gu; Zhen-Hua Ling

doi:10.18653/v1/2024.findings-acl.729

X-ACE: Explainable and Multi-factor Audio Captioning Evaluation

Abstract

Automated audio captioning (AAC) aims to generate descriptions based on audio input, attracting exploration of emerging audio language models (ALMs). However, current evaluation metrics only provide a single score to assess the overall quality of captions without characterizing the nuanced difference by systematically going through an evaluation checklist. To this end, we propose the explainable and multi-factor audio captioning evaluation (X-ACE) paradigm. X-ACE identifies four main factors that constitute the majority of audio features, specifically sound event, source, attribute and relation. To assess a given caption from an ALM, it is firstly transformed into an audio graph, where each node denotes an entity in the caption and corresponds to a factor. On the one hand, graph matching is conducted from part to whole for a holistic assessment. On the other hand, the nodes contained within each factor are aggregated to measure the factor-level performance. The pros and cons of an ALM can be explicitly and clearly demonstrated through X-ACE, pointing out the direction for further improvements. Experiments show that X-ACE exhibits better correlation with human perception and can detect mismatches sensitively.

Anthology ID:: 2024.findings-acl.729
Volume:: Findings of the Association for Computational Linguistics: ACL 2024
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 12273–12287
Language:
URL:: https://aclanthology.org/2024.findings-acl.729
DOI:: 10.18653/v1/2024.findings-acl.729
Bibkey:
Cite (ACL):: Qian Wang, Jia-Chen Gu, and Zhen-Hua Ling. 2024. X-ACE: Explainable and Multi-factor Audio Captioning Evaluation. In Findings of the Association for Computational Linguistics: ACL 2024, pages 12273–12287, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: X-ACE: Explainable and Multi-factor Audio Captioning Evaluation (Wang et al., Findings 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.findings-acl.729.pdf

PDF Cite Search