Language as a Label: Zero-Shot Multimodal Classification of Everyday Postures under Data Scarcity

Ming Ze Tang; Jubal Chandy Jacob

Language as a Label: Zero-Shot Multimodal Classification of Everyday Postures under Data Scarcity

Abstract

This paper investigates how the specificity of natural language prompts influences zero-shot classification performance in modern vision language models (VLMs) under severe data scarcity. Using a curated 285 image subset of MS COCO containing three everyday postures (sitting, standing, and walking/running), we evaluate OpenCLIP, MetaCLIP2, and SigLIP alongside unimodal and pose-based baselines. We introduce a three tier prompt design, minimal labels, action cues, and compact geometric descriptions and systematically vary only the linguistic detail. Our results reveal a counterintuitive trend where simpler prompts consistently outperform more detailed ones, a phenomenon we term prompt overfitting. Grad-CAM attribution further shows that prompt specificity shifts attention between contextual and pose-relevant regions, explaining the model dependent behaviour. The study provides a controlled analysis of prompt granularity in low resource image based posture recognition, highlights the need for careful prompt design when labels are scarce.

Anthology ID:: 2025.mmloso-1.5
Volume:: Proceedings of the 1st Workshop on Multimodal Models for Low-Resource Contexts and Social Impact (MMLoSo 2025)
Month:: December
Year:: 2025
Address:: Mumbai, India
Editors:: Ankita Shukla, Sandeep Kumar, Amrit Singh Bedi, Tanmoy Chakraborty
Venues:: MMLoSo | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 48–57
Language:
URL:: https://aclanthology.org/2025.mmloso-1.5/
DOI:
Bibkey:
Cite (ACL):: Ming Ze Tang and Jubal Chandy Jacob. 2025. Language as a Label: Zero-Shot Multimodal Classification of Everyday Postures under Data Scarcity. In Proceedings of the 1st Workshop on Multimodal Models for Low-Resource Contexts and Social Impact (MMLoSo 2025), pages 48–57, Mumbai, India. Association for Computational Linguistics.
Cite (Informal):: Language as a Label: Zero-Shot Multimodal Classification of Everyday Postures under Data Scarcity (Tang & Jacob, MMLoSo 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.mmloso-1.5.pdf

PDF Cite Search Fix data