UniS-MMC: Multimodal Classification via Unimodality-supervised Multimodal Contrastive Learning

Multimodal learning aims to imitate human beings to acquire complementary information from multiple modalities for various downstream tasks. However, traditional aggregation-based multimodal fusion methods ignore the inter-modality relationship, treat each modality equally, suffer sensor noise, and thus reduce multimodal learning performance. In this work, we propose a novel multimodal contrastive method to explore more reliable multimodal representations under the weak supervision of unimodal predicting. Specifically, we first capture task-related unimodal representations and the unimodal predictions from the introduced unimodal predicting task. Then the unimodal representations are aligned with the more effective one by the designed multimodal contrastive method under the supervision of the unimodal predictions. Experimental results with fused features on two image-text classification benchmarks UPMC-Food-101 and N24News show that our proposed Unimodality-Supervised MultiModal Contrastive UniS-MMC learning method outperforms current state-of-the-art multimodal methods. The detailed ablation study and analysis further demonstrate the advantage of our proposed method.


Introduction
Social media has emerged as an important avenue for communication.The content is often multimodal, e.g., via text, speech, audio, and videos.Multimodal tasks that employ multiple data sources include image-text classification and emotion recognition, which could be used for specific applications in daily life, such as web search (Chang et al., 2022), guide robot (Moon and Seo, 2019).Hence, there is a need for an effective representation strategy for multimodal content.A common way is to fuse unimodal representations.: Unimodal representation of a single modality can be either effective or not.The effectiveness of different unimodal representations from the same sample also varies.To empower the interaction between modalities, our proposed method aligns the unimodal representation to the effective modality sample-wise and makes full use of the effective unimodal representation under the supervision of the unimodal prediction (F and T represent correct and incorrect predictions, respectively).

Despite the recent progress in obtaining effective
unimodal representations from large pre-trained models (Devlin et al., 2019;Liu et al., 2019;Dosovitskiy et al., 2021), fusing for developing more trustworthy and complementary multimodal representations remains a challenging problem in the multimodal learning area.
To solve the multimodal fusion problem, researchers propose aggregation-based fusion methods to combine unimodal representations.These methods include aggregating unimodal features (Castellano et al., 2008;Nagrani et al., 2021), aggregating unimodal decisions (Ramirez et al., 2011;Tian et al., 2020a), and aggregating both (Wu et al., 2022) of them.However, these aggregation-based methods ignore the relation between modalities that affects the performance of multimodal tasks (Udandarao et al., 2020).To solve this issue, the alignment-based fusion methods are introduced to strengthen the inter-modality relationship by aligning the embeddings among different modalities.Existing alignment-based methods can be divided into two categories: architecture-based and contrastive-based.The architecture-based methods introduce a specific module for mapping features to the same space (Wang et al., 2016) or design an adaption module before minimizing the spatial distance between source and auxiliary modal distributions (Song et al., 2020).On the other hand, the contrastive-based methods efficiently align different modality representations through the contrastive learning on paired modalities (Liu et al., 2021b;Zolfaghari et al., 2021;Mai et al., 2022).
The unsupervised multimodal contrastive methods directly regard the modality pairs from the same samples as positive pairs and those modality pairs from different samples as negative pairs to pull together the unimodal representations of paired modalities and pull apart the unimodal representations of unpaired modalities in the embedding space.(Tian et al., 2020b;Akbari et al., 2021;Zolfaghari et al., 2021;Liu et al., 2021b;Zhang et al., 2021a;Taleb et al., 2022).Supervised multimodal contrastive methods are proposed to treat sample pairs with the same label as positive pairs and sample pairs with a different label as negative pairs in the mini-batch (Zhang et al., 2021b;Pinitas et al., 2022).In this way, the unimodal representations with the same semantics will be clustered.
Despite their effectiveness in learning the correspondence among modalities, these contrastivebased multimodal learning methods still meet with problems with the sensor noise in the in-the-wild datasets (Mittal et al., 2020).The current methods always treat each modality equally and ignore the difference of the role for different modalities, The final decisions will be negatively affected by those samples with inefficient unimodal representations and thus can not provide trustworthy multimodal representations.In this work, we aim to learn trustworthy multimodal representations by aligning unimodal representations towards the effective modality, considering modality effectiveness in addition to strengthening relationships between modalities.The modality effectiveness is decided by the unimodal prediction and the contrastive learning is under the weak supervision information from the unimodal prediction.As shown in Figure 1, the unimodal representations will be aligned towards those with correct unimodal predictions.In summary, our contributions are: • To facilitate the inter-modality relationship for multimodal classification, we combine the aggregation-based and alignment-based fusion methods to create a joint representation.
• We propose UniS-MMC to efficiently align the representation to the effective modality under weak supervision of unimodal prediction to address the issue of different contributions from the modailities.
• Extensive experiments on two image-text classification benchmarks, UPMC-Food-101 (Wang et al., 2015) and N24News (Wang et al., 2022) demonstrate the effectiveness of our proposed method.

Related Work
In this section, we will introduce the related work on contrastive learning and multimodal learning.

Contrastive Learning
Contrastive learning (Hadsell et al., 2006;Oord et al., 2018;Qin and Joty, 2022) captures distinguishable representations by drawing positive pairs closer and pushing negative pairs farther contrastively.In addition to the above single-modality representation learning, contrastive methods for multiple modalities are also widely explored.The common methods (Radford et al., 2021;Jia et al., 2021;Kamath et al., 2021;Li et al., 2021;Zhang et al., 2022;Taleb et al., 2022;Chen et al., 2022) leverage the cross-modal contrastive matching to align two different modalities and learn the inter-modality correspondence.Except the intermodality contrastive, Visual-Semantic Contrastive (Yuan et al., 2021), XMC-GAN (Zhang et al., 2021a) and CrossPoint (Afham et al., 2022) also introduce the intra-modality contrastive for representation learning.Besides, CrossCLR (Zolfaghari et al., 2021) removes the highly related samples from negative samples to avoid the bias of false negatives.GMC (Poklukar et al., 2022) builds the contrastive learning process between the modalityspecific representations and the global representations of all modalities instead of the cross-modal representations.

Multimodal Learning
Multimodal learning is expected to build models based on multiple modalities and to improve the general performance from the joint representation (Ngiam et al., 2011;Baltrušaitis et al., 2018;Gao et al., 2020).The fusion operation among multiple modalities is one of the key topics in multimodal learning to help the modalities complement each other (Wang, 2021).Multimodal fusion methods are generally categorized into two types: alignmentbased and aggregation-based fusion (Baltrušaitis et al., 2018).Alignment-based fusion (Gretton et al., 2012;Song et al., 2020) 2022)).In addition to these jointrepresentation generating methods, some works further propose to evaluate the attended modalities and features before fusing.M3ER (Mittal et al., 2020) conducts a modality check step to finding those modalities with small correlation and Multimodal Dynamics (Han et al., 2022) evaluates both the feature-and modality-level informativeness during extracting unimodal representations.

Methodology
In this section, we present our method called UniS-MMC for multimodal fusion.

Notation
Suppose we have the training data set of modality a and b are extracted from the respective encoders f θa and f θb .Following the parameter sharing method in the multi-task learning (Pilault et al., 2021;Bhattacharjee et al., 2022), the representations are shared directly between unimodal prediction tasks and the following multimodal prediction task.With weak supervision information produced from the respective unimodal classifier g φ a and g φ b , the final prediction is finally learned based on the updated multimodal representations r c and the multimodal classifier g φ c .

Unimodality-supervised Multimodal Contrastive Learning
First, the unimodal representations are extracted from the raw data of each modality by the pretrained encoders.We introduce the uni-modality check step to generate the weak supervision for checking the effectiveness of each unimodal representation.Then we illustrate how we design the unimodality-supervised multimodal contrastive learning method among multiple modalities to learn the multimodal representations.

Modality Encoder
Given multimodal training data {x m } M m=1 , the raw unimodal data of modality m are firstly processed with respective encoders to obtain the hidden representations.We denote the learned hidden representation f θm (x m ) of modality m as r m .We use the pretrained ViT (Dosovitskiy et al., 2021) as the feature encoder for images in both UPMC Food-101 and N24News datasets.We use only the pretrained BERT (Devlin et al., 2019) as the feature encoder for the textual description in these datasets.Besides, we also try the pretrained RoBERTa (Liu et al., 2019) for text sources in N24News.

Unimodality Check
Unimodal prediction.Different from the common aggregation-based multimodal learning methods which only use the unimodal learned representations for fusion, our method also use the unimodal representations as inputs to the unimodal predicting tasks.The classification module can be regarded as a probabilistic model: g φ : R → P, which maps the hidden representation to a predictive distribution p(y | r).For a unimodal predicting task, the predictive distribution is only based on the output of the unimodal classifier.The learning objective of the unimodal predicting task is to minimize each unimodal prediction loss: where y k is the k-th element category label and Unimodality effectiveness.The above unimodal prediction results are used to check the supervised information for deciding the effectiveness of each modality.The unimodal representation with correct prediction is regarded as the effective representation for providing the information to the target label.Alternately, the unimodal representation with the wrong prediction is regarded as an ineffective representation.

Multimodal Contrastive Learning
We aim to reduce the multimodal prediction bias caused by treating modalities equally for each sam- When considering two specific modalities m a and m b of n-th sample, we generate two unimodal hidden representations r n a and r n b from respective unimodal encoders.From the above unimodal predicting step, we also obtain the unimodal prediction results, ŷn a and ŷn b .As the summarization in Table 1, we define the following positive pair, negative pair and semi-positive pair: Positive b for sample n, τ is the temperature coefficient.The similarity of positive pairs and semipositive pairs is optimized towards a higher value while the similarity of negative pairs is optimized towards a smaller value.The difference between positive and semi-positive pairs is that the unimodal representations updated towards each other in positive pairs while only the unimodal representations of the wrong unimodal prediction updated towards the correct one in semi-positive pairs.We detach the modality feature with correct predictions from the computation graph when aligning with lowquality modality features for semi-positive pairs, which is inspired by GAN models (Arjovsky et al., 2017;Zhu et al., 2017) where the generator output is detached when updating the discriminator only, Multimodal problems often encounter situations with more than two modalities.For more than two modalities, the multimodal contrastive loss for M modalities (M > 2) can be computed by:

Fusion and Total Learning Objective
Multimodal prediction.When fusing all unimodal representations with concatenation, we get the fused multimodal representations r c = r 1 ⊕ r 2 ⊕ ... ⊕ r m .Similarly, the multimodal predictive distribution is the output of the multimodal classifier with inputs of the fused representations.
For the multimodal prediction task, the target is to minimize the multimodal prediction loss: where y k is the k-th element category label and Total learning objective.The overall optimization objective for our proposed UniS-MMC is: where λ is a loss coefficient for balancing the predicting loss and the multimodal contrastive loss.

Experimental Setup
Dataset and metric.We evaluate our method on two publicly available image-text classification datasets UPMC-Food-101 and N24News.UPMC-Food-1011 is a multimodal classification dataset that contains textual recipe descriptions and the corresponding images for 101 kinds of food.We get this dataset from their project website and split 5000 samples from the default training set as the validation set.N24News2 is an news classification dataset with four text types (Heading, Caption, Abstract and Body) and images.In order to supplement the long text data of the FOOD101 dataset, we choose the first three text sources from N24News in our work.We use classification accuracy (Acc) as evaluation metrics for UPMC-Food-101 and N24News.The detailed dataset information can be seen in Appendix A.1.Implementation.For the image-text dataset UPMC Food-101, we use pretrained BERT (Devlin et al., 2019) as a text encoder and pretrained vision transformer (ViT) (Dosovitskiy et al., 2021) as an image encoder.For N24News, we utilize two different pretrained language models, BERT and RoBERTa (Liu et al., 2019) as text encoders and also the same vision transformer as an image encoder.All classifiers of these two image-text classification datasets are three fully-connected layers with a ReLU activation function.
The default reported results on image-text datasets are obtained with BERT-base (or RoBERTa-base) and ViT-base in this paper.The performance is presented with the average and standard deviation of three runs on Food101 and N24News.The codes is available on GitHub3 .The detailed settings of the hyper-parameter are summarized in Appendix A.2.

Baseline Models
The used baseline models are as follows: • MMBT (Kiela et al., 2019) jointly finetunes pretrained text and image encoders by projecting image embeddings to text token space on BERTlike architecture.• HUSE (Narayana et al., 2019) creates a joint representation space by learning the cross-modal representation with semantic information.
• ME (Liang et al., 2022) is the state-of-theart method on Food101, which performs crossmodal feature transformation to leverage crossmodal information.
• N24News (Wang et al., 2022) train both the unimodal and multimodal predicting task to capture the modality-invariant representations.
• AggMM finetunes the pretrained text and image encoders and concatenates the unimodal representations for the multimodal recognition task.
• SupMMC and UnSupMMC finetune the pretrained text and image encoders and then utilize the supervised and unsupervised multimodal contrastive method to align unimodal representations before creating joint embeddings, respectively.

Performance Comparison
Final classification performance comparison.
The final image-text classification performance on Food101 and N24News is presented in Table 2.We have the following findings from the experimental results: (i) focusing on the implemented methods, contrastive-based methods with naive alignment could get an improvement over the implemented aggregation-based methods; (ii) the implemented contrastive-based methods outperform many of the recent novel multimodal methods; (iii) the proposed UniS-MMC has a large improvement com-pared with both the implemented contrastive-based baseline models and the recent start-of-art multimodal methods on Food101 and produces the best results on every kind of text source on N24News with the same encoders.
T-sne visualization comparison with baseline models.We visualize the representation distribution of the proposed uni-modality supervised multimodal contrastive method and compare it with the naive aggregation-based method and the typical unsupervised and supervised contrastive method.
As shown in Figure 4, unimodal representations are summarized and mapped into the same feature space.The previous typical contrastive methods, such as unsupervised and supervised contrastive methods will mix up different unimodal representations from different categories when bringing the representation of different modalities that share the same semantics closer.For example, the representations of two modalities from the same category are clustered well in Figure 4 (b) and (c) (green circle and orange circle).However, these contrastivebased methods can also bring two problems.One is that they map the unimodal embeddings into the same embedding space will lose the complementary information from different modalities.Another is that they heavily mix the representations from the specific class with other categories, such as the clusters (orange circle).As a comparison, our proposed method preserves the complementary multimodal information by maintaining the two parts of the distribution from two modalities (red line) well (Figure 4   It is different from the other two typical contrastivebased methods.Generally, our proposed method not only helps the unimodal representation learning process and gets better sub-clusters for each modality but also improves the classification boundary of the final multimodal representation.

Analysis
Classification with Different Combinations of Input Modalities.We first perform an ablation study of classification on N24News with different input modalities.MMC achieves 0.6% to 2.4% improvement over the aggregation-based baseline model with BERT and 0.3% to 1.4% improvement with RoBERTa.
Ablation study on N24News.We conduct the ablation study to analyze the contribution of the different components of the proposed UniS-MMC on N24News.AggMM is the baseline model of the aggregation-based method that combines the unimodal representation directly.The ablation works on three text source headline, caption and abstract with both BERT-based and RoBERTa-based models.Specifically, L uni is the introduced unimodal prediction task, C Semi and C N eg are semi-positive pair and negative pair setting, respectively.
Table 4 presents the multimodal classification results of the above ablation stud with different participating components.L uni and the setting of C Semi align the unimodal representation towards the targets, with the former achieved by mapping different unimodal representations to the same target space and the latter achieved by feature distribution aligning.They can both provide a significant improvement over the baseline model.C N eg further improve the performance by getting a larger combination of multimodal representation with more complementary information for those samples that are difficult to classify.
Analysis on the learning process.To further explore the role of our proposed UniS-MMC in aligning the unimodal representation towards the targets, we summarise the unimodal predicting re-

Analysis on the Final Multimodal Decision.
Compared with the proposed UniS-MMC, MT-MML is the method that jointly trains the unimodal and multimodal predicting task, without applying the proposed multimodal contrastive loss.We summarize unimodal performance on MT-MML and UniS-MMC and present unimodal predictions in Figure 6.The unimodal prediction consistency here is represented by the consistency of the unimodal prediction for each sample.When focusing on the classification details of each modality pair, we find that the proposed UniS-MMC gives a larger proportion of samples with both correct predictions and a smaller proportion of samples with both wrong decisions and opposite unimodal decisions compared with MT-MML.

Conclusion
In this work, we propose the Unimodality-Supversied Multimodal Contrastive (UNniS-MMC), a novel method for multimodal fusion to reduce the multimodal decision bias caused by inconsistent unimodal information.Based on the introduced multi-task-based multimodal learning framework, we capture the task-related unimodal representations and evaluate their potential influence on the final decision with the unimodal predictions.Then we contrastively align the unimodal representation towards the relatively reliable modality under the weak supervision of unimodal predictions.This novel contrastive-based alignment method helps to capture more trustworthy multimodal representations.The experiments on four public multimodal classification datasets demonstrate the effectiveness of our proposed method.

Limitations
Unlike the traditional multimodal contrastive loss focusing more on building the direct link between paired modalities, our proposed UniS-MMC aims to leverage inter-modality relationships and potential effectiveness among modalities to create more trustworthy and complementary multimodal representations.It means that UniS-MMC is not applied to all multimodal problems.It can achieve competitive performance in tasks that rely on the quantity of the joint representation, such as the multimodal classification task.It is not suitable for tasks that rely purely on correspondence between modalities, such as the cross-modal retrieval task.
Figure1: Unimodal representation of a single modality can be either effective or not.The effectiveness of different unimodal representations from the same sample also varies.To empower the interaction between modalities, our proposed method aligns the unimodal representation to the effective modality sample-wise and makes full use of the effective unimodal representation under the supervision of the unimodal prediction (F and T represent correct and incorrect predictions, respectively).

Figure 2 :
Figure 2: The framework for our proposed UniS-MMC.

Figure 3 :
Figure 3: The relationship comparison between two modalities in training mini-batch of (a) unsupervised MMC, (b) supervised MMC and (c) UniS-MMC.
pair.If both the paired unimodal predictions are correct, we define these unimodal representation pairs are positive pairs, namely P, where P = {n|{ŷ n a ≡ y n and ŷn b ≡ y n } N n=1 } in the minibatch B. Negative pair.If both the paired unimodal predictions are wrong, we define these unimodal representation pairs are negative pairs, namely N, where N = {n|{ŷ n a ̸ = y n and ŷn b ̸ = y n } N n=1 } in the minibatch B. Semi-Positive pair.If the predictions of the paired unimodal representations are mutually exclusive, one correct and another wrong, we define these unimodal representation pairs are semi-positive pairs, namely S, where S = {{n|{ŷ n a ≡ y n and ŷn b ̸ = y n } N n=1 } ∪ {n|ŷ n a ̸ = y n and ŷn b ≡ y n } N n=1 }} in the mini-batch.We further propose the multimodal contrastive loss for two modalities as follows: L b−mmc = − log n∈P,S (exp(cos(r n a (d) from the aggregation-based methods (Figure 4 (a)) in addition to a better cluster of unimodal representations.We further summarized the visualization of the final multimodal representation in Figure 5. Comparing Figure 5 (a) and Figure 5 (d), the proposed

Figure 4 :
Figure 4: Unimodal representation distribution of the first 10 categories of the N24News test set across different methods: (a) aggregation-based method, (b) unsupervised multimodal method, (c) supervised contrastive method and (d) unimodality-supervised method.

Figure 5 :
Figure 5: Multimodal representation distribution of the first 10 categories of the N24News test set across different methods: (a) aggregation-based method, (b) unsupervised multimodal method, (c) supervised contrastive method and (d) unimodality-supervised method.

Figure 7 :
Figure 7: Consistency comparison of unimodal prediction between MT-MML and the UniS-MMC.

Table 2 :
Comparison of multimodal classification performance on a) Food101 and b) N24News.