Exploring Fine-Grained Human Motion Video Captioning

Bingchan Zhao, Xinyi Liu, Zhuocheng Yu, Tongchen Yang, Yifan Song, Mingyu Jin, Sujian Li, Yizhou Wang


Abstract
Detailed descriptions of human motion are crucial for effective fitness training, which highlights the importance of research in fine-grained human motion video captioning. Existing video captioning models often fail to capture the nuanced semantics of videos, resulting in the generated descriptions that are coarse and lack details, especially when depicting human motions. To benchmark the Body Fitness Training scenario, in this paper, we construct a fine-grained human motion video captioning dataset named BoFiT and design a state-of-the-art baseline model named BoFiT-Gen (Body Fitness Training Text Generation). BoFiT-Gen makes use of computer vision techniques to extract angular representations of human motions from videos and LLMs to generate fine-grained descriptions of human motions via prompting. Results show that BoFiT-Gen outperforms previous methods on comprehensive metrics. We aim for this dataset to serve as a useful evaluation set for visio-linguistic models and drive further progress in this field. Our dataset is released at https://github.com/colmon46/bofit.
Anthology ID:
2025.coling-main.351
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5247–5264
Language:
URL:
https://aclanthology.org/2025.coling-main.351/
DOI:
Bibkey:
Cite (ACL):
Bingchan Zhao, Xinyi Liu, Zhuocheng Yu, Tongchen Yang, Yifan Song, Mingyu Jin, Sujian Li, and Yizhou Wang. 2025. Exploring Fine-Grained Human Motion Video Captioning. In Proceedings of the 31st International Conference on Computational Linguistics, pages 5247–5264, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Exploring Fine-Grained Human Motion Video Captioning (Zhao et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.351.pdf