AbstractIn recent years, multimodal sentiment analysis (MSA) has attracted more and more interest, which aims to predict the sentiment polarity expressed in a video. Existing methods typically 1) treat three modal features (textual, acoustic, visual) equally, without distinguishing the importance of different modalities; and 2) split the video into frames, leading to missing the global acoustic information. In this paper, we propose a global Acoustic feature enhanced Modal-Order-Aware network (AMOA) to address these problems. Firstly, a modal-order-aware network is designed to obtain the multimodal fusion feature. This network integrates the three modalities in a certain order, which makes the modality at the core position matter more. Then, we introduce the global acoustic feature of the whole video into our model. Since the global acoustic feature and multimodal fusion feature originally reside in their own spaces, contrastive learning is further employed to align them before concatenation. Experiments on two public datasets show that our model outperforms the state-of-the-art models. In addition, we also generalize our model to the sentiment with more complex semantics, such as sarcasm detection. Our model also achieves state-of-the-art performance on a widely used sarcasm dataset.