Unimodal and Crossmodal Refinement Network for Multimodal Sequence Fusion

Xiaobao Guo, Adams Kong, Huan Zhou, Xianfeng Wang, Min Wang


Abstract
Effective unimodal representation and complementary crossmodal representation fusion are both important in multimodal representation learning. Prior works often modulate one modal feature to another straightforwardly and thus, underutilizing both unimodal and crossmodal representation refinements, which incurs a bottleneck of performance improvement. In this paper, Unimodal and Crossmodal Refinement Network (UCRN) is proposed to enhance both unimodal and crossmodal representations. Specifically, to improve unimodal representations, a unimodal refinement module is designed to refine modality-specific learning via iteratively updating the distribution with transformer-based attention layers. Self-quality improvement layers are followed to generate the desired weighted representations progressively. Subsequently, those unimodal representations are projected into a common latent space, regularized by a multimodal Jensen-Shannon divergence loss for better crossmodal refinement. Lastly, a crossmodal refinement module is employed to integrate all information. By hierarchical explorations on unimodal, bimodal, and trimodal interactions, UCRN is highly robust against missing modality and noisy data. Experimental results on MOSI and MOSEI datasets illustrated that the proposed UCRN outperforms recent state-of-the-art techniques and its robustness is highly preferred in real multimodal sequence fusion scenarios. Codes will be shared publicly.
Anthology ID:
2021.emnlp-main.720
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9143–9153
Language:
URL:
https://aclanthology.org/2021.emnlp-main.720
DOI:
10.18653/v1/2021.emnlp-main.720
Bibkey:
Cite (ACL):
Xiaobao Guo, Adams Kong, Huan Zhou, Xianfeng Wang, and Min Wang. 2021. Unimodal and Crossmodal Refinement Network for Multimodal Sequence Fusion. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9143–9153, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Unimodal and Crossmodal Refinement Network for Multimodal Sequence Fusion (Guo et al., EMNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.emnlp-main.720.pdf
Video:
 https://aclanthology.org/2021.emnlp-main.720.mp4
Data
CMU-MOSEIMultimodal Opinionlevel Sentiment Intensity