ConKI: Contrastive Knowledge Injection for Multimodal Sentiment Analysis

Multimodal Sentiment Analysis leverages multimodal signals to detect the sentiment of a speaker. Previous approaches concentrate on performing multimodal fusion and representation learning based on general knowledge obtained from pretrained models, which neglects the effect of domain-specific knowledge. In this paper, we propose Contrastive Knowledge Injection (ConKI) for multimodal sentiment analysis, where specific-knowledge representations for each modality can be learned together with general knowledge representations via knowledge injection based on an adapter architecture. In addition, ConKI uses a hierarchical contrastive learning procedure performed between knowledge types within every single modality, across modalities within each sample, and across samples to facilitate the effective learning of the proposed representations, hence improving multimodal sentiment predictions. The experiments on three popular multimodal sentiment analysis benchmarks show that ConKI outperforms all prior methods on a variety of performance metrics.


Introduction
Multimodal sentiment analysis (MSA) is the task of mining and comprehending the sentiments of online videos, which has many downstream applications, e.g., analyzing the overall opinion from customers about a product, gauging polling intentions from voters (Han et al., 2021;Melville et al., 2009), etc.Most existing MSA methods focus on developing fusion techniques between modalities.The easiest way is to simply concatenate text, video, and audio features as a fused vector for subsequent classification or regression.An alternative is to use outer-product, Recurrent Neural Networks (RNNs) or attention-based models to model multimodal interactions (Chen et al., 2017;Williams et al., 2018;Zadeh et al., 2017;Liu and Shen, 2018).More recently, MSA methods for learning effective multimodal representations have emerged constantly (Hazarika et al., 2020;Mai et al., 2021;Yu et al., 2021), ranging from decomposing the representation of each modality to introducing extra constraints in the learning objective.
Although the above methods have led to improvements in MSA performance, they focus on utilizing general knowledge obtained from pretrained models to encode modalities, which is inadequate to identify specific sentiments across modalities.One possibility to solve this issue is through knowledge injection which can generate specific knowledge to aid the general knowledge for further improving predictions.Many researchers have discovered that injecting knowledge from other sources such as linguistic knowledge, encyclopedia knowledge, and domain-specific knowledge can help enhance existing pretrained language models in terms of knowledge awareness and lead to improved performance on various downstream tasks (Wei et al., 2021;Lauscher et al., 2020;Wang et al., 2021a).
In this paper, we propose ConKI, a Contrastive Knowledge Injection framework, to learn both panknowledge representations and knowledge-specific representations to boost MSA performance.We argue that a unimodal representation can consist of a pan-knowledge representation (given by a pretrained model like BERT (Devlin et al., 2019)) and a knowledge-specific representation (injected from relevant external sources).Specifically, ConKI uses a pretrained BERT model to extract textual pan-knowledge representations and uses two randomly initialized transformer encoders to generate acoustic and visual pan-knowledge representations, respectively.In the meantime, it applies a knowledge injection model named adapter, onto each modality to yield knowledge-specific representations.Both pan-and specific-knowledge representations are fused first within each modality and then across modalities, before the fused features are used for sentiment prediction.We further propose a hierarchical contrastive learning procedure performed between knowledge types within every single modality, across modalities within each sample, and across samples, to facilitate the learning of these representations in ConKI.
The main contributions of this work can be summarized as follows: • We propose ConKI, a Contrastive Knowledge Injection framework for multimodal sentiment analysis.ConKI aims to boost model performance through external knowledge injection from other datasets and hierarchical contrastive learning, which is proved better than simply fine-tuning with external datasets.
• We propose hierarchical contrastive learning that uses a unified contrastive loss to disentangle the pan-knowledge representations from the specific-knowledge representations since they belong to different knowledge domains and should complement each other.
• We conduct extensive experiments on three popular benchmark MSA datasets and attain results that are superior to the existing state-of-the-art MSA baselines on all metrics, demonstrating the effectiveness of the proposed methods in ConKI.

Related Work
In this section, we discuss related research in multimodal sentiment analysis, knowledge injection, and contrastive learning.

Multimodal Sentiment Analysis
Research on MSA mainly focuses on multimodal fusion and representation learning.For multimodal fusion, existing methods are typically divided into early fusion and late fusion techniques.Early fusion refers to joining multimodal inputs into a single feature before single-model encoding.For example, Williams et al. (2018) concatenate initial input features and then use LSTM to capture the temporal dependencies in the sequence.On the contrary, late fusion learns unimodal representations via separate models and fuses them in a later stage for inference.Zadeh et al. (2017) introduce a tensor fusion network that first encodes each modality with corresponding sub-networks and then models the unimodal, bimodal, and trimodal interactions by a three-fold Cartesian product.For representation learning methods, Hazarika et al. (2020) propose to project each modality into a modalityinvariant and modality-specific representation.Different from the above work, we propose to decompose each modality into two representations based on knowledge types.Both representations can complement each other, leading to a richer unimodal representation.

Knowledge Injection
Injecting knowledge into pretrained language models (PLMs) has been proven to outperform vanilla pretrained models on various NLP tasks (Wei et al., 2021;Wang et al., 2021a;Tian et al., 2020;Ke et al., 2020;Lin et al., 2019;Wang et al., 2021b).Adapters are commonly used as a knowledge injection model plugged outside or inside of PLMs.For instance, Wang et al. (2021a) infuse factual knowledge from Wikidata (Vrandečić and Krötzsch, 2014) and linguistic knowledge from web text to RoBERTa (Liu et al., 2019) via two kinds of adapters.In this work, we build different adapters for different modalities, not limited to text, to learn specific multimodal knowledge from an external dataset for the downstream task.To the best of our knowledge, we are the first to explore knowledge injection in the multimodal domain.

Contrastive Learning
Contrastive learning (CL) aims to learn effective representations such that positive pairs of samples are close while negative pairs of samples are far apart (Liu et al., 2021;Li et al., 2020;Chen et al., 2020a;Khosla et al., 2020;He et al., 2020).Existing works can be divided into two categories: self-supervised CL (Akbari et al., 2021;Chen et al., 2020a,b;He et al., 2020;You et al., 2020;Tao et al., 2020) and supervised CL (Khosla et al., 2020;Mai et al., 2021).The difference between them is whether the label information is used to form positive/negative pairs.For example, Khosla et al. (2020) propose supervised CL to pull samples of the same class together and push samples from different classes away.In our work, we design contrastive pairs in finer granularity.That is, we consider contrasts between knowledge types, between modalities, and across samples.

Method
In this section, we explain the Contrastive Knowledge Injection framework (ConKI) in detail.The goal of ConKI is to generate pan-and specific- knowledge modality representations via knowledge injection and hierarchical contrastive learning.Knowledge injection intends to obtain knowledgespecific representations that could complement the pan-knowledge representations offered by pretrained models.Hierarchical contrastive learning further optimizes these knowledge-specific and pan-knowledge representations by considering contrasts between knowledge types, modalities, and samples.

Problem Definition
The task of multimodal sentiment analysis (MSA) is to detect sentiments in videos based on multimodal signals, including text (t), vision (v), and audio (a) modalities.These signals are represented as sequences of low-level features, i.e., I t ∈ R lt×dt , I v ∈ R lv×dv , and I a ∈ R la×da , respectively.Here l m∈{t,v,a} denotes the length of the sequence for each modality, while d m∈{t,v,a} denotes the corresponding feature vector dimension.The detail for acquiring these features is described in Appendix B. Given these sequences I m∈{t,v,a} , the primary task is to make accurate predictions on the sentiment intensity by extracting and fusing higher-level multimodal information.

Overall Architecture
Figure 1 shows the overall architecture of ConKI.We first process raw multimodal input to low-level features I m∈{t,v,a} with their corresponding feature extractors and tokenizers.Then we encode I m into knowledge-specific representations (i.e., A m ) generated by some adapters and pan-knowledge representations (i.e., O m ) generated by pretrained encoders.The text encoder is from publicly-available pretrained backbones like BERT (Devlin et al., 2019), and the vision/audio encoder is a designed model with random initialization since there is no suitable backbone that is pretrained by the above low-level features.After generating the knowledge-specific and pan-knowledge representations, ConKI is trained simultaneously with two different tasks on the downstream target dataset -the primary MSA regression task and the contrastive learning subtask.
For the MSA task, we concatenate the knowledge-specific representation and panknowledge representation of each modality before feeding them into a fully-connected (FC) layer for inner-modality fusion.We then design a fusion network that consists of a concatenation layer and a fusion module for multi-modality fusion, as shown in Figure 2. The fused representations are passed into a multilayer perceptron (MLP) network to produce the sentiment predictions, ŷ.
For the subtask of hierarchical contrastive learning, we carefully construct the negative and positive sample pairs at the knowledge level, modality level, and sample level.The intuition of our pairing policy is as follows.We expect A m and O m to capture different knowledge, so we disentangle them and make them complement each other to get richer modality representations by knowledgelevel contrasts.Since a video's sentiment is determined by all modalities, we learn the commonalities among the six representations by modalitylevel contrasts.Besides, videos that express close sentiments should share some correlations.We capture the correlations by sample-level contrasts to help further learn the commonalities among samples under close sentiments.By integrating these hierarchical contrasts, ConKI is able to catch full dynamics among representations which can significantly benefit the main MSA task.

Encoding with Knowledge Injection
We encode each modality into a pan-knowledge representation via the pretrained encoders and a knowledge-specific representation via the adapters.
Pan-knowledge representations.We use the pretrained BERT (Devlin et al., 2019) to encode the input sentence for the text modality.The pooled output vector in the last layer is extracted as the whole sentence representation O t : where H t denotes the hidden states of all layers.
For audio and vision modalities, we employ encoders of stacked transformer layers (Vaswani et al., 2017) to capture the temporal features O m : Here, O t , O a , and O v are regarded as three panknowledge representations since they mainly contain general knowledge such as the generic facts encoded by BERT (Devlin et al., 2019) pretrained on big text data.Knowledge-specific representations.We infuse specific domain knowledge from external multimodal sources through knowledge injection models (adapters).The adapter is commonly used in natural language processing (NLP) to enhance existing pretrained language models' knowledge awareness (Wei et al., 2021).The outputs of adapters are taken as knowledge-specific representations.Specifically, the adapter for each modality is plugged outside of the respective pretrained encoder, as shown in Figure 3.It consists of multiple modules with the same sandwich structure: two FC layers with two transformer layers in between.Each module can be inserted before any transformer layers of the backbone models (encoders), e.g., the second and fourth transformer layers in Figure 3. Therefore, each module takes the intermediate layers' hidden states of the pretrained encoder and the output of the previous adapter module as input.The output of the adapter is denoted as A m , where (3) With the objective of learning specific multimodal sentiment knowledge, we pretrain one adapter for each modality, i.e., Adapter t , Adapter a and Adapter v , concurrently using an external dataset while keeping the pretrained encoders frozen.Since the external dataset we select is also from the multimodal sentiment domain, the pretraining task remains the MSA task.That is, we ), where [•; •] denotes the concatenation of two vectors 3 Multi-modality fusion: | m ∈ {t, v, a}} end end pretrain adapter parts in Figure 1 with only the MSA task on the external dataset, then utilize the pretrained adapters to produce knowledge-specific representations A m for the downstream target task that includes both the MSA task and the hierarchical contrastive learning subtask.Algorithm 1 summarizes this pretraining procedure of adapters.

Hierarchical Contrastive Learning
In our framework, we propose a hierarchical contrastive learning method to enhance the learned representations by considering the following four aspects in a batch B: • For a single video sample i, all the modalities share common motives of the speaker that determine the overall sentiment.The panknowledge representations of different modalities are expected to represent similar meanings and thus need to be pulled closer to each other.And the same applies for knowledgespecific representations.This intuition leads to the construction of intra-sample positive pairs: • The pan-knowledge representations and the knowledge-specific representations should be disentangled from each other since they belong to different knowledge domains and are designed to complement each other.This exists inside each sample (i and j represent the same sample) as well as across samples in the batch (i and j represent two different samples).Therefore, we can build the inter-knowledge negative pairs within a batch: • For two arbitrary samples i and j having close sentiments, i.e., their sentiment scores can be rounded to the same integer, six representations of sample i (i.e., O i m and A i m ) should be close to the corresponding representations of sample j (i.e., O j n and A j n ).Note that the subscripts m and n represent the modality for sample i and j, respectively.We then form the inter-sample positive pairs as where y i denotes the ground-truth of sample i, and r(•) stands for the round function; • Except for the pairs derived from the above three aspects, the remaining pairs with sample i in the same batch are set as negative pairs N i 2 .Please refer to Appendix A.1 for a more detailed pairing policy.Specifically, our hierarchical contrastive loss L con is computed by , where, In the above equation, |P i 1 ∪ P i 2 | means the number of positive pairs with sample i in a batch B, (•, •) denotes a pair in the corresponding set, e.g., , and τ is a scalar temperature parameter.The rationale behind this hierarchical contrastive learning subtask is as follows.First, we capture the commonalities across the three modalities within each knowledge type of each sample to reduce the modality gaps under a shared motive.Second, we model the commonalities across samples of close sentiments within each knowledge type to reduce the sample gaps.Third, we capture the differences between the pan-knowledge representations and the knowledge-specific representations in each sample which results in a complementary effect of the two knowledge types of representations.Last but not least, we capture the differences across samples of different sentiments within each knowledge type in order to learn the dynamics of different sentiment intervals.

Training Procedure
Given the ground truth y and the predictions ŷ, we can calculate the main MSA task loss by the mean squared error: where |B| is the number of samples in a batch.ConKI adopts the learning regime of pretraining followed by fine-tuning.We first pretrain the adapters in ConKI with L task using an external dataset while fixing the model parameters of the pretrained backbones, considering much fewer trainable parameters compared to backbones.Then we fine-tune ConKI with the downstream target dataset by optimizing the overall loss L: where λ is a hyperparameter that balances the MSA task loss and the hierarchical contrastive loss.Algorithm 1 shows the full training procedure of ConKI.

Experiments
In this section, we present some experimental details, including datasets, evaluation metrics, baseline models, and experimental results.The implementation details are shown in Appendix B.

Datasets and Metrics
We conduct experiments on three publicly available benchmark datasets in MSA: CMU-MOSI (Zadeh et al., 2016), CMU-MOSEI (Zadeh et al., 2018) and SIMS (Yu et al., 2020).Table 1 shows the statistics of the datasets.Appendix C describes the details of these datasets.
Following the previous works (Sun et al., 2020;Rahman et al., 2020;Hazarika et al., 2020;Yu et al., 2021;Mai et al., 2021;Han et al., 2021;Yu et al., 2020), we report our experimental results in two forms: regression and classification.For regression, we report mean absolute error (MAE) and Pearson correlation (Corr).For classification, we report binary classification accuracy (Acc-2) and F1 score.Specifically, for CMU-MOSI and CMU-MOSEI datasets, we calculate Acc-2 and F1 scores in negative/positive (zero excluded) and nonnegative/positive (zero included) settings as well as seven-class classification accuracy (Acc-7) which shows the percentage of predictions that correctly classified into the same interval of seven intervals between −3 and +3.Higher values indicate better performance for all metrics except for MAE.

Results
In accordance with previous work, we run our model five times under the same hyper-parameter settings and report the average performance of all metrics in Table 2 and Table 3 It is notable that the MAE of ConKI on CMU-MOSI outperforms the best baseline model MMIM by around 0.02, which shows ConKI is able to learn effective representations for the MSA task since MAE is the most commonly used evaluation metric in regression tasks.ConKI also presents an excellent performance in the Corr scores on both CMU-MOSI and CMU-MOSEI datasets.The possible reasoning behind this excellent performance is that ConKI uses contrastive learning for recognizing the samples under different sentiments, which could lead to effective ranking results among samples and thus produce a higher Corr score (Swinscow et al., 2002).
Furthermore, Acc-7 of ConKI on CMU-MOSI surpasses the best baseline by 1.78.Though performing classification, especially seven-class classification, is difficult in a regression task, ConKI successfully leverages the contrasts across samples that are classified into seven intervals (by the round function described in Section 3.4) to model the sample dynamics, which brings a great improvement to Acc-7 and Acc-2, demonstrating the efficacy of ConKI in representation learning for MSA.In addition, ConKI shows excellent F1 scores on all datasets, which endorse its potential in real-world applications since F1 is valuable for evaluating imbalanced datasets.

Ablation Study
We first conduct an ablation study about modalities, as shown in Table 4.We can observe that the inclusion of all three modalities significantly improves the performance of ConKI.
To show the benefits of the proposed knowledge injection and hierarchical contrastive learning in ConKI, we conduct a series of ablation experiments on CMU-MOSI, as shown in Table 5 and Table 6.ConKI mainly includes four components: the use of the external dataset (C1), adapters for knowledge injection (C2), pretrained encoders for panknowledge (C3), and hierarchical contrastive learning (C4).Since the spotlight in our hierarchical contrastive learning is the contrasts between knowledge types, we also compare our model with the model w/o N 1 trained with L con but without negative pairs N 1 , i.e., without disentangling the pan knowledge and specific knowledge.We can conclude that learning differentiated pan-knowledge and knowledgespecific representations is essential in our hierarchical contrastive learning.To better understand the learned pan-and specific-knowledge representations by our hierarchical contrastive learning, we visualize and analyze these representations in Appendix A.2.
To further examine if our performance gain is from the external dataset instead of the proposed knowledge injection and contrastive learning technique, we compare our model with the state-of-theart baseline models which are fine-tuned by the external dataset.The results from els even though they are trained with the external dataset.Therefore, our gain from ConKI is not solely from adding more data, but from knowledge injection with multi-step transfer learning.Considering the size of CMU-MOSEI is much larger than CMU-MOSI, injecting CMU-MOSEI's knowledge into CMU-MOSI thus has more effects on the downstream task than injecting CMU-MOSI into CMU-MOSEI, as shown in Table 2.

Conclusion
In this paper, we present ConKI, a Contrastive Knowledge Injection framework for multimodal sentiment analysis, which learns knowledgespecific representations along with pan-knowledge representations via knowledge injection and hierarchical contrastive learning.ConKI utilizes the pretrained encoders to obtain pan-knowledge representations while generating knowledge-specific representations based on injected adapters that are trained on an external knowledge source.With the specific knowledge, ConKI is able to produce more accurate sentiment predictions than solely using the pan-knowledge representations.To further improve the learning of these representations, we specifically design a hierarchical contrastive learning procedure taking into account the contrasts between knowledge types within each modality, across modalities within one sample, and across samples.Experimental results on three benchmark datasets show that ConKI outperforms all state-ofthe-art methods on a range of performance metrics.

Limitations
Our research presents an initial step toward a knowledge injection framework for MSA and still has some limitations to be tackled in the future.Firstly, we can learn more disentangled representations by carefully selecting contrastive pairs for further improvement.Secondly, it will be interest-ing if we extend our method with multiple external sources that come from different knowledge domains.Figure 4: Pairing example of three samples where sample 1 and sample 2 are in the same sentiment interval while sample 3 is in a different sentiment interval.Grey cells with "1" stand for the positive pairs, and white cells with "0" represent the negative pairs.an example batch that consists of three samples where sample 1 and sample 2 belong to the same sentiment interval while sample 3 falls in a different sentiment interval in Figure 4.In this figure, the "1"s in the heatmap represent the positive pairs of the row vectors and column vectors.The "0"s represent the negative pairs of each two vectors.
• From this figure, we can get the intra-sample positive pairs as the "1"s with red borders: • We represent the inter-knowledge negative pairs N B 1 as the "0"s in the blue zone; • Since sample 1 and sample 2 have close sentiment scores, we form the inter-sample positive pairs as the "1"s without red borders: • The remaining white cells with "0" show the negative pairs in N B 2 which aim to push sample 3 away from sample 1 and sample 2 because they have different sentiments.

A.2 Visualization of Modality Representations
The motivation for us to propose hierarchical contrastive learning into ConKI is that we think modalities will be closer to each other within one sample and across samples in the same sentiment interval and will be far away across samples in different intervals while two knowledge contained in one modality will also be different.We use t-SNE (Van der Maaten and Hinton, 2008) to visualize the distributions of the six representations learned by ConKI with and without hierarchical contrastive learning, as shown in Figure 5.
Though we divide all samples into seven intervals to perform contrastive learning, we take samples of two intervals from the testing set to show the learned representations before and after contrastive learning due to the simplicity of the visualization.From Figure 5 (a), we can easily observe that some of the representations such as the pan-knowledge representations in light blue and the knowledgespecific representations in dark green of samples in two different intervals overlap extremely with each other.
In contrast, these overlapping representations are pushed further in Figure 5 (b) due to samplelevel contrasts.It is also obvious that the three knowledge-specific representations of samples in the same interval, e.g., A t , A v , A a of Interval 2 in dark colors and star shape become closer because of both modality-level and sample-level contrasts.Moreover, the distance between the knowledgespecific representations and the pan-knowledge representations, e.g., A v in dark green and O v in light green of Interval 2, becomes larger in Figure 5 (b) by knowledge-level contrasts.All of these indicate ConKI is able to perform desired contrastive learning for learning better representations that help improve the performance, even in the generalized scenario, i.e., in the testing set.

B Implementation Details
We use unaligned raw data in all experiments as the previous works (Yu et al., 2021;Han et al., 2021) for fair comparisons.For audio and video modalities, two commonly-used toolkits (COVAREP (Degottex et al., 2014) and Facial Action Coding System (FACS) (Ekman and Rosenberg, 2005)) act as the feature extractors, respectively.We use uncased 12-layers BERT pretrained model 1 as the text encoder, and two 2-layer transformers as the video and audio encoders, respectively.Adapter t has three modules inserted before the first, sixth, and eleventh layers of BERT sequentially.Adapter v and Adapter a have one module inserted before the second layer of the corresponding encoder.We use CMU-MOSEI as the external dataset for CMU-MOSI and SIMS while using CMU-MOSI for CMU-MOSEI.During the pretraining, the learning rate is set to 5e − 5 and we train for 10 epochs with one epoch for a linear warm-up scheduler.During the fine-tuning of CMU-MOSI and SIMS, the learning rates for encoders and other components are set to 5e − 6 and 1e − 6 respectively with a decay 0.001.The temperature parameter τ is set to 0.07 and λ is set to 0.01 after grid-search.We fine tune for 200 epochs with batch size 32.For the fine-tuning of CMU-MOSEI, the learning rates for the text encoder and others are 5e − 6 and 5e − 5, respectively.λ for CMU-MOSEI is set to 0.001.The best performance on the validation dataset is used for testing.We implement our experiments using PyTorch (Paszke et al., 2019) on an Nvidia RTX 2080Ti GPU.

C Datasets
CMU-MOSI is a popular benchmark dataset collected from YouTube.It contains 2,199 video clips sliced from 93 videos where a speaker shares opinions on topics such as movies.Each video is an-1 https://huggingface.co/bert-base-uncased notated with sentiment scores ranging from −3 (strongly negative) to +3 (strongly positive).CMU-MOSEI is the largest MSA dataset that has greater diversity in speakers, topics, and annotations.It contains 22,856 annotated video segments from 1,000 distinct speakers and 250 topics.Each clip also has sentiment scores between [−3, +3].SIMS is a Chinese MSA dataset that contains 2,281 refined video segments.Each sample has one multimodal label and three unimodal labels, with sentiment scores from −1 to +1.We translate the Chinese text into English2 so that we can inject knowledge from English MSA datasets into SIMS.For fair comparisons, all baseline models use the English version to evaluate the performance.

D Baseline Models
TFN.The Tensor Fusion Network (TFN) (Zadeh et al., 2017) encodes three modalities with corresponding embedding subnetworks and uses outerproduct to model the unimodal, bimodal, and trimodal interactions as the fusion results.
LMF.The Low-rank Multimodal Fusion (LMF) (Liu and Shen, 2018) utilizes low-rank tensors to improve efficiency of multimodal fusion.
MulT.The Multimodal Transformer (MulT) (Tsai et al., 2019) proposes directional pairwise cross-modal attention that adapts one modality into another for multimodal fusion.
ICCN.The Interaction Canonical Correlation Network (ICCN) (Sun et al., 2020) learns textbased audio and text-based video features by op-timizing canonical loss.These features are concatenated with the text features for downstream classifiers such as logistic regression.
MISA.The Modality-Invariant and -Specific Representations (MISA) (Hazarika et al., 2020) designs a multitask loss including task prediction loss, reconstruction loss, similarity loss, and difference loss to learn modality-invariant and modalityspecific representations.
MAG-BERT.The Multimodal Adaptation Gate for Bert (MAG-BERT) (Rahman et al., 2020) builds an alignment gate that allows audio and video information to leak into the BERT model for multimodal fusion.
Self-MM.The Self-Supervised Multitask Learning (Self-MM) (Yu et al., 2021) proposes a label generation module based on self-supervised learning to obtain unimodal supervision.Then they joint train the multimodal and unimodal tasks for better fusion results.
HyCon.The Hybrid Contrastive Learning (HyCon) (Mai et al., 2021) performs intra-and inter-modal contrastive learning as well as semicontrastive learning within a modality to explore cross-modal interactions.
MMIM.MultiModal InfoMax (MMIM) (Han et al., 2021) maximizes the mutual information in unimodal input pairs as well as between multimodal fusion result and unimodal input to aid the main MSA task.D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.
D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure 1: The overall architecture of ConKI.The solid and dashed arrows represent the procedure of the main MSA task and the hierarchical contrastive learning subtask, respectively.Inside the contrastive learning procedure, cyan and pink boxes illustrate samples that fall in different sentiment score intervals.

Figure 2 :
Figure 2: The Fusion Network.The fusion module marked in the dashed box is used to get the weighted fused embedding.meanselement-wise multiplication.

Figure 3 :
Figure 3: Adapter and its connection with the backbone encoder.

Figure 5 :
Figure 5: The visualization of the six decomposed representations of samples in the same sentiment interval and different intervals in (a) w/o h-CL; (b) ConKI.In each subfigure, light yellow, light blue, and light green represent pan-knowledge representations of text, audio, and video modalities, respectively, while dark yellow, dark blue, and dark green represent knowledge-specific representations accordingly.Each point or star stands for a sample in Interval 1 or Interval 2.

C2.
Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?See Appendix B. C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?See Section 4.3.C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?See Appendix B. D Did you use human annotators (e.g., crowdworkers) or research with human participants?Left blank.