Referring Image Segmentation via Joint Mask Contextual Embedding Learning and Progressive Alignment Network

Ziling Huang; Shin’ichi Satoh

doi:10.18653/v1/2023.emnlp-main.481

Referring Image Segmentation via Joint Mask Contextual Embedding Learning and Progressive Alignment Network

Abstract

Referring image segmentation is a task that aims to predict pixel-wise masks corresponding to objects in an image described by natural language expressions. Previous methods for referring image segmentation employ a cascade framework to break down complex problems into multiple stages. However, its defects also obvious: existing methods within the cascade framework may encounter challenges in both maintaining a strong focus on the most relevant information during specific stages of the referring image segmentation process and rectifying errors propagated from early stages, which can ultimately result in sub-optimal performance. To address these limitations, we propose the Joint Mask Contextual Embedding Learning Network (JMCELN). JMCELN is designed to enhance the Cascade Framework by incorporating a Learnable Contextual Embedding and a Progressive Alignment Network (PAN). The Learnable Contextual Embedding module dynamically stores and utilizes reasoning information based on the current mask prediction results, enabling the network to adaptively capture and refine pertinent information for improved mask prediction accuracy. Furthermore, the Progressive Alignment Network (PAN) is introduced as an integral part of JMCELN. PAN leverages the output from the previous layer as a filter for the current output, effectively reducing inconsistencies between predictions from different stages. By iteratively aligning the predictions, PAN guides the Learnable Contextual Embedding to incorporate more discriminative information for reasoning, leading to enhanced prediction quality and a reduction in error propagation. With these methods, we achieved state-of-the-art results on three commonly used benchmarks, especially in more intricate datasets. The code will be released.

Anthology ID:: 2023.emnlp-main.481
Volume:: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Houda Bouamor, Juan Pino, Kalika Bali
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7753–7762
Language:
URL:: https://aclanthology.org/2023.emnlp-main.481/
DOI:: 10.18653/v1/2023.emnlp-main.481
Bibkey:
Cite (ACL):: Ziling Huang and Shin’ichi Satoh. 2023. Referring Image Segmentation via Joint Mask Contextual Embedding Learning and Progressive Alignment Network. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7753–7762, Singapore. Association for Computational Linguistics.
Cite (Informal):: Referring Image Segmentation via Joint Mask Contextual Embedding Learning and Progressive Alignment Network (Huang & Satoh, EMNLP 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.emnlp-main.481.pdf
Video:: https://aclanthology.org/2023.emnlp-main.481.mp4

PDF Cite Search Video Fix data