2024
pdf
bib
abs
Uncertainty-Aware Cross-Modal Alignment for Hate Speech Detection
Chuanpeng Yang
|
Fuqing Zhu
|
Yaxin Liu
|
Jizhong Han
|
Songlin Hu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Hate speech detection has become an urgent task with the emergence of huge multimodal harmful content (, memes) on social media platforms. Previous studies mainly focus on complex feature extraction and fusion to learn discriminative information from memes. However, these methods ignore two key points: 1) the misalignment of image and text in memes caused by the modality gap, and 2) the uncertainty between modalities caused by the contribution degree of each modality to hate sentiment. To this end, this paper proposes an uncertainty-aware cross-modal alignment (UCA) framework for modeling the misalignment and uncertainty in multimodal hate speech detection. Specifically, we first utilize the cross-modal feature encoder to capture image and text feature representations in memes. Then, a cross-modal alignment module is applied to reduce semantic gaps between modalities by aligning the feature representations. Next, a cross-modal fusion module is designed to learn semantic interactions between modalities to capture cross-modal correlations, providing complementary features for memes. Finally, a cross-modal uncertainty learning module is proposed, which evaluates the divergence between unimodal feature distributions to to balance unimodal and cross-modal fusion features. Extensive experiments on five publicly available datasets show that the proposed UCA produces a competitive performance compared with the existing multimodal hate speech detection methods.
pdf
bib
abs
Uncertainty-Guided Modal Rebalance for Hateful Memes Detection
Chuanpeng Yang
|
Yaxin Liu
|
Fuqing Zhu
|
Jizhong Han
|
Songlin Hu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hateful memes detection is a challenging multimodal understanding task that requires comprehensive learning of vision, language, and cross-modal interactions. Previous research has focused on developing effective fusion strategies for integrating hate information from different modalities. However, these methods excessively rely on cross-modal fusion features, ignoring the modality uncertainty caused by the contribution degree of each modality to hate sentiment and the modality imbalance caused by the dominant modality suppressing the optimization of another modality. To this end, this paper proposes an Uncertainty-guided Modal Rebalance (UMR) framework for hateful memes detection. The uncertainty of each meme is explicitly formulated by designing stochastic representation drawn from a Gaussian distribution for aggregating cross-modal features with unimodal features adaptively. The modality imbalance is alleviated by improving cosine loss from the perspectives of inter-modal feature and weight vectors constraints. In this way, the suppressed unimodal representation ability in multimodal models would be unleashed, while the learning of modality contribution would be further promoted. Extensive experimental results demonstrate that the proposed UMR produces the state-of-the-art performance on four widely-used datasets.
2023
pdf
bib
abs
QAP: A Quantum-Inspired Adaptive-Priority-Learning Model for Multimodal Emotion Recognition
Ziming Li
|
Yan Zhou
|
Yaxin Liu
|
Fuqing Zhu
|
Chuanpeng Yang
|
Songlin Hu
Findings of the Association for Computational Linguistics: ACL 2023
Multimodal emotion recognition for video has gained considerable attention in recent years, in which three modalities (i.e., textual, visual and acoustic) are involved. Due to the diverse levels of informational content related to emotion, three modalities typically possess varying degrees of contribution to emotion recognition. More seriously, there might be inconsistencies between the emotion of individual modality and the video. The challenges mentioned above are caused by the inherent uncertainty of emotion. Inspired by the recent advances of quantum theory in modeling uncertainty, we make an initial attempt to design a quantum-inspired adaptive-priority-learning model (QAP) to address the challenges. Specifically, the quantum state is introduced to model modal features, which allows each modality to retain all emotional tendencies until the final classification. Additionally, we design Q-attention to orderly integrate three modalities, and then QAP learns modal priority adaptively so that modalities can provide different amounts of information based on priority. Experimental results on the IEMOCAP and MOSEI datasets show that QAP establishes new state-of-the-art results.
2022
pdf
bib
abs
AMOA: Global Acoustic Feature Enhanced Modal-Order-Aware Network for Multimodal Sentiment Analysis
Ziming Li
|
Yan Zhou
|
Weibo Zhang
|
Yaxin Liu
|
Chuanpeng Yang
|
Zheng Lian
|
Songlin Hu
Proceedings of the 29th International Conference on Computational Linguistics
In recent years, multimodal sentiment analysis (MSA) has attracted more and more interest, which aims to predict the sentiment polarity expressed in a video. Existing methods typically 1) treat three modal features (textual, acoustic, visual) equally, without distinguishing the importance of different modalities; and 2) split the video into frames, leading to missing the global acoustic information. In this paper, we propose a global Acoustic feature enhanced Modal-Order-Aware network (AMOA) to address these problems. Firstly, a modal-order-aware network is designed to obtain the multimodal fusion feature. This network integrates the three modalities in a certain order, which makes the modality at the core position matter more. Then, we introduce the global acoustic feature of the whole video into our model. Since the global acoustic feature and multimodal fusion feature originally reside in their own spaces, contrastive learning is further employed to align them before concatenation. Experiments on two public datasets show that our model outperforms the state-of-the-art models. In addition, we also generalize our model to the sentiment with more complex semantics, such as sarcasm detection. Our model also achieves state-of-the-art performance on a widely used sarcasm dataset.