Multimodal sarcasm detection (MSD) is essential for various downstream tasks. Existing MSD methods tend to rely on spurious correlations. These methods often mistakenly prioritize non-essential features yet still make correct predictions, demonstrating poor generalizability beyond training environments. Regarding this phenomenon, this paper undertakes several initiatives. Firstly, we identify two primary causes that lead to the reliance of spurious correlations. Secondly, we address these challenges by proposing a novel method that integrate Multimodal Incongruities via Contrastive Learning (MICL) for multimodal sarcasm detection. Specifically, we first leverage incongruity to drive multi-view learning from three views: token-patch, entity-object, and sentiment. Then, we introduce extensive data augmentation to mitigate the biased learning of the textual modality. Additionally, we construct a test set, SPMSD, which consists potential spurious correlations to evaluate the the model’s generalizability. Experimental results demonstrate the superiority of MICL on benchmark datasets, along with the analyses showcasing MICL’s advancement in mitigating the effect of spurious correlation.
Video Question Answering (VideoQA) aims to answer questions about the visual content of a video. Current methods mainly focus on improving joint representations of video and text. However, these methods pay little attention to the fine-grained semantic interaction between video and text. In this paper, we propose Mulan: a Multi-Level Alignment Model for Video Question Answering, which establishes alignment between visual and textual modalities at the object-level, frame-level, and video-level. Specifically, for object-level alignment, we propose a mask-guided visual feature encoding method and a visual-guided text description method to learn fine-grained spatial information. For frame-level alignment, we introduce the use of visual features from individual frames, combined with a caption generator, to learn overall spatial information within the scene. For video-level alignment, we propose an expandable ordinal prompt for textual descriptions, combined with visual features, to learn temporal information. Experimental results show that our method outperforms the state-of-the-art methods, even when utilizing the smallest amount of extra visual-language pre-training data and a reduced number of trainable parameters.
For sentence-level extractive summarization, there is a disproportionate ratio of selected and unselected sentences, leading to flatting the summary features when maximizing the accuracy. The imbalanced classification of summarization is inherent, which can’t be addressed by common algorithms easily. In this paper, we conceptualize the single-document extractive summarization as a rebalance problem and present a deep differential amplifier framework. Specifically, we first calculate and amplify the semantic difference between each sentence and all other sentences, and then apply the residual unit as the second item of the differential amplifier to deepen the architecture. Finally, to compensate for the imbalance, the corresponding objective loss of minority class is boosted by a weighted cross-entropy. In contrast to previous approaches, this model pays more attention to the pivotal information of one sentence, instead of all the informative context modeling by recurrent or Transformer architecture. We demonstrate experimentally on two benchmark datasets that our summarizer performs competitively against state-of-the-art methods. Our source code will be available on Github.