How Well Do Vision Models Encode Diagram Attributes?

Haruto Yoshida, Keito Kudo, Yoichi Aoki, Ryota Tanaka, Itsumi Saito, Keisuke Sakaguchi, Kentaro Inui


Abstract
Research on understanding and generating diagrams has used vision models such as CLIP. However, it remains unclear whether these models accurately identify diagram attributes, such as node colors and shapes, along with edge colors and connection patterns. This study evaluates how well vision models recognize the diagram attributes by probing the model and retrieving diagrams using text queries. Experimental results showed that while vision models can recognize differences in node colors, shapes, and edge colors, they struggle to identify differences in edge connection patterns that play a pivotal role in the semantics of diagrams. Moreover, we revealed inadequate alignment between diagram attributes and language representations in the embedding space.
Anthology ID:
2024.acl-srw.47
Volume:
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Xiyan Fu, Eve Fleisig
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
564–575
Language:
URL:
https://aclanthology.org/2024.acl-srw.47
DOI:
Bibkey:
Cite (ACL):
Haruto Yoshida, Keito Kudo, Yoichi Aoki, Ryota Tanaka, Itsumi Saito, Keisuke Sakaguchi, and Kentaro Inui. 2024. How Well Do Vision Models Encode Diagram Attributes?. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 564–575, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
How Well Do Vision Models Encode Diagram Attributes? (Yoshida et al., ACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.acl-srw.47.pdf