Bingchan Zhao
2025
Exploring Fine-Grained Human Motion Video Captioning
Bingchan Zhao
|
Xinyi Liu
|
Zhuocheng Yu
|
Tongchen Yang
|
Yifan Song
|
Mingyu Jin
|
Sujian Li
|
Yizhou Wang
Proceedings of the 31st International Conference on Computational Linguistics
Detailed descriptions of human motion are crucial for effective fitness training, which highlights the importance of research in fine-grained human motion video captioning. Existing video captioning models often fail to capture the nuanced semantics of videos, resulting in the generated descriptions that are coarse and lack details, especially when depicting human motions. To benchmark the Body Fitness Training scenario, in this paper, we construct a fine-grained human motion video captioning dataset named BoFiT and design a state-of-the-art baseline model named BoFiT-Gen (Body Fitness Training Text Generation). BoFiT-Gen makes use of computer vision techniques to extract angular representations of human motions from videos and LLMs to generate fine-grained descriptions of human motions via prompting. Results show that BoFiT-Gen outperforms previous methods on comprehensive metrics. We aim for this dataset to serve as a useful evaluation set for visio-linguistic models and drive further progress in this field. Our dataset is released at https://github.com/colmon46/bofit.
Search
Fix data
Co-authors
- Mingyu Jin 1
- Sujian Li (李素建) 1
- Xinyi Liu 1
- Yifan Song 1
- Yizhou Wang 1
- show all...