Predicate Debiasing in Vision-Language Models Integration for Scene Graph Generation Enhancement

Yuxuan Wang; Xiaoyuan Liu

doi:10.18653/v1/2024.emnlp-main.97

Predicate Debiasing in Vision-Language Models Integration for Scene Graph Generation Enhancement

Abstract

Scene Graph Generation (SGG) provides basic language representation of visual scenes, requiring models to grasp complex and diverse semantics between objects. This complexity and diversity in SGG leads to underrepresentation, where parts of triplet labels are rare or even unseen during training, resulting in imprecise predictions. To tackle this, we propose integrating the pretrained Vision-language Models to enhance representation. However, due to the gap between pretraining and SGG, direct inference of pretrained VLMs on SGG leads to severe bias, which stems from the imbalanced predicates distribution in the pretraining language set. To alleviate the bias, we introduce a novel LM Estimation to approximate the unattainable predicates distribution. Finally, we ensemble the debiased VLMs with SGG models to enhance the representation, where we design a certainty-aware indicator to score each sample and dynamically adjust the ensemble weights. Our training-free method effectively addresses the predicates bias in pretrained VLMs, enhances SGG’s representation, and significantly improve the performance.

Anthology ID:: 2024.emnlp-main.97
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1627–1639
Language:
URL:: https://aclanthology.org/2024.emnlp-main.97/
DOI:: 10.18653/v1/2024.emnlp-main.97
Bibkey:
Cite (ACL):: Yuxuan Wang and Xiaoyuan Liu. 2024. Predicate Debiasing in Vision-Language Models Integration for Scene Graph Generation Enhancement. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1627–1639, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Predicate Debiasing in Vision-Language Models Integration for Scene Graph Generation Enhancement (Wang & Liu, EMNLP 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.emnlp-main.97.pdf

PDF Cite Search Fix data