AbstractMultimodal sentiment analysis aims to predict sentiment of language text with the help of other modalities, such as vision and acoustic features. Previous studies focused on learning the joint representation of multiple modalities, ignoring some useful knowledge contained in language modal. In this paper, we try to incorporate sentimental words knowledge into the fusion network to guide the learning of joint representation of multimodal features. Our method consists of two components: shallow fusion part and aggregation part. For the shallow fusion part, we use crossmodal coattention mechanism to obtain bidirectional context information of each two modals to get the fused shallow representations. For the aggregation part, we design a multitask of sentimental words classification to help and guide the deep fusion of the three modalities and obtain the final sentimental words aware fusion representation. We carry out several experiments on CMU-MOSI, CMU-MOSEI and YouTube datasets. The experimental results show that introducing sentimental words prediction as a multitask can really improve the fusion representation of multiple modalities.