Adaptive Contrastive Learning on Multimodal Transformer for Review Helpfulness Prediction

Modern Review Helpfulness Prediction systems are dependent upon multiple modalities, typically texts and images. Unfortunately, those contemporary approaches pay scarce attention to polish representations of cross-modal relations and tend to suffer from inferior optimization. This might cause harm to model’s predictions in numerous cases. To overcome the aforementioned issues, we propose Multi-modal Contrastive Learning for Multimodal Review Helpfulness Prediction (MRHP) problem, concentrating on mutual information between input modalities to explicitly elaborate cross-modal relations. In addition, we introduce Adaptive Weighting scheme for our contrastive learning approach in order to increase flexibility in optimization. Lastly, we propose Multimodal Interaction module to address the unalignment nature of multimodal data, thereby assisting the model in producing more reasonable multimodal representations. Experimental results show that our method outperforms prior baselines and achieves state-of-the-art results on two publicly available benchmark datasets for MRHP problem.


Introduction
Current e-commerce sites such as Amazon, Ebay, etc., construct review platforms to collect user feedback concerning their products. These platforms play a fundamental role in online transactions since they help future consumers collect useful reviews which assist them in deciding whether to make the purchase or not. Unfortunately, nowadays the number of user-generated reviews is overwhelming, raising doubts related to the relevance and veracity of reviews. Therefore, there is a need to verify the quality of reviews before publishing them to prospective customers. As a result, this inspires a recent surge of interest targeting the Review Helpfulness Prediction (RHP) problem. * Corresponding Author

Product Information
The Cooks Standard 6-Quart Stainless Steel Stockpot with Lid is made with 18/10 stainless steel with an aluminum disc layered in the bottom. The aluminum disc bottom provides even heat distribution and prevents hot spots. Tempered glass lid with steam hole vent makes viewing food easy. Stainless steel riveted handles offer durability. Induction compatible. Works on gas, electric, glass, ceramic, etc. Oven safe to 500F, glass lid to 350F. Dishwasher safe.
Review 1 I needed a stainless steel pot for canning my tomatoes. I learned the hard way that you have to use a non-reactive pot or else your end result will be inedible (I thought I was using stainless steel but quickly realized it wasnt) I headed to Amazon and came across this Cooks Standard SS Cookpot with cover and bought it after reading the reviews. I have had it for just under a year and it still looks just as good as the day I bought it. I couldn't be happier with my purchase! Oh, and by the way, this one actually is stainless steel unlike the other pot I bought that said it was and wasn't. Two principal groups of early efforts focus on purely textual data. The first group follows feature engineering techniques, retrieving argument-based features (Liu et al., 2017), lexical features (Kr-ishnamoorthy, 2015), and semantic features (Kim et al., 2006), as input to their classifier. Inherently, their methods are labor-intensive and vulnerable to the typical issues of conventional machine learning methods. Instead of relying on manual features, the second group leverages deep neural models, for instance, RNN (Alsmadi et al., 2020) and CNN (Chen et al., 2018), to learn rich features automatically. Nonetheless, their approach is ineffective because the helpfulness of a review is not only contingent upon textual information but also other modalities.
To cope with the above issues, recent works (Liu et al., 2021b;Han et al., 2022) proposed to utilize multi-modality via the Multi-perspective Coherent Reasoning (MCR) model. Hypothesizing that a review is helpful if it exhibits coherent text and images with the product information, those works take into account both textual and visual modality of the inputs, then estimate their coherence level to discern whether the reviews are helpful or unhelpful. However, the MCR model contains a detrimental drawback. Particularly, it aims to maximize the scores s p of positive (helpful) product-review pairs while minimizing those s n of negative (unhelpful) pairs. Hence, it was assumed that following the aforementioned manner would project features with similar semantics to stay close and those with disparate ones to be distant apart. Unfortunately, in multimodal learning, this was shown not to be the case, causing the model to learn ad-hoc representations (Zolfaghari et al., 2021). This is one reason leading to unreasonable predictions of MCR in Table 1. As it can be seen, even though Review 1 closely relates to the product of "6-Quart Stainless Steel Stockpot", the model classifies it as unhelpful. In addition, the target of Review 2's text content is vague because it does not specifically correspond to the "Stockpot". In fact, it can be used for any product. Moreover, the image does not clearly show any hint of the "Stockpot" as well. Despite such vagueness, the output of MCR for Review 2 is still helpful.
As a remedy to this problem, we propose Crossmodal Contrastive Learning to mine the mutual information of cross-modal relations in the input to capture more sensible representations. Nonetheless, plainly applying symmetric gradient pattern, which is similar to MCR that they assign equivalent penalty to s n and s p , is inflexible. In cases that s p is small and s n is already negatively skewed, or both s p and s n are positively skewed, it is irrational to assign equivalent penalties to both s p and s n . Last but not least, MCR directly leverages Coherent Reasoning, repeatedly enforcing alignment among modalities in the input. This ignores the unaligned nature of multimodal input, for example, images might only refer to a particular section in the text, hence do not completely align with the textual content. In consequence, strictly forming alignment can make the model learn inefficient multimodal representations (Tsai et al., 2019).
To overcome the above problems, we propose an adaptive scheme to accomplish the flexibility in the optimization of our contrastive learning stage. Finally, we propose to adopt a multimodal attention module that reinforces one modality's high-level features with low-level ones of other modalities. This not only relaxes the alignment assumption but also informs one modality of information of others, encouraging refined representation learning.
In sum, our contributions are three-fold: • We propose an Adaptive Cross-modal Contrastive Learning for Review Helpfulness Prediction task by polishing cross-modal relation representations.
• We propose a Multimodal Interaction module which correlates modalities' features without depending upon the alignment assumption.
• We conducted extensive experiments on two datasets for the RHP problem and found that our method outperforms other baselines which are both textual-only and multimodal, and obtains state-of-the-art results on those benchmarks.

Model Architecture
In this section we delineate the overall architecture of our MRHP model. Particular modules of our system are depicted in Figure 1.

Problem Definition
Given a product item p, which consists of a description T p and images I p , and a set of reviews R = {r 1 , . . . , r N }, where each review is composed of user-generated text T r i and images I r i , RHP model's task is to generate the scores  where N is the number of reviews for product p and f is the scoring function of the RHP model. Empirically, each score estimated by f indicates the helpfulness level of each review, and the groundtruth is the descending sort order of helpfulness scores.

Encoding Modules
Our model accepts product description T p , product images I p , review text T r i , and review images I r i as input. The encoding process of those elements is described as follows.
Text Encoding Product description and review text are sequences of words. Each sequence is indexed into the word embedding layer and then passed into the respective LSTM layer for product or review.
where K p ∈ R lp×d , K r ∈ R lr×d , l p and l r are the sequence lengths of product and review text respectively, and d is the hidden size. Image Encoding We follow Anderson et al. (2018) to take detected objects as embeddings of the image. In particular, a pre-trained Faster R-CNN is applied to extract ROI features for m objects {a 1 , a 2 , . . . , a m } from the product and review images. Subsequently, we encode extracted features using the self-attention module (SelfAttn) (Vaswani et al., 2017) where A ∈ R m×d and d is the hidden size. Here we use A p and A r to indicate product and review image features, respectively.

Multimodal Interaction Module
We consider two components γ, η with their inputs X γ , X η , where η is the concatenation of input elements apart from the one in γ. For instance, if indicates the concatenation operation. We define each cross-modal attention block to have three components Q, K, and V : where W Qγ ∈ R dγ ×d k , W Kη ∈ R dη×d k , and W Vη ∈ R dη×dv are weight matrices. The interaction between γ and η is computed in the crossattention manner Our full module comprises D layers of the abovementioned attention block, as indicated in the right part of Figure 1. Theoretically, the computation is carried out as follows where LN denotes layer normalization operator. We iteratively estimate cross-modal features for product text, product images, review text, and review images with a view to obtaining H p , V p , H r , and V r .
After our cross-modal interaction module, we proceed to pass features to undertake relation fusion in three paths: intra-modal, inter-modal, and intra-review. Intra-modal Fusion The intra-modal alignment is calculated for two relation kinds: (1) product text -review text and (2) product image -review image. Firstly, we learn alignment among intramodal features via self-attention modules Then intra-modal hidden representations are fed to a CNN, and continuously a max-pooling layer to attain salient entries Inter-modal Fusion Similar to intra-modal alignment, inter-modal one is calculated for two types of relations as well: (1) product text -review image and (2) product image -review text. The first step is also to relate feature components using selfattention modules We adopt a mean-pool layer to aggregate intermodal features and then concatenate the pooled vectors to construct the final inter-modal represen-tation I prd_txt -rev_img = MeanPool(H prd_txt -rvw_img ) (20) Intra-review Fusion The estimation of intrareview module completely mimics the inter-modal manner. The only discrimination is that the estimation is taken upon two different relations: (1) product text -product image and (2) review textreview image.
Finally, we concatenate intra-modal, inter-modal, and intra-review output, and then feed the concatenated vector to the linear layer to obtain the ranking score: 3 Training Strategies

Adaptive Cross-modal Contrastive Learning
In this section, we explain the formulation and adaptive pattern along with its derivation of our Cross-modal Contrastive Learning. Cross-modal Contrastive Learning First of all, we extract hidden states of helpful product-review pairs. Second of all, hidden features are maxpooled to extract meaningful entries.
We formulate our contrastive learning framework taking positive and negative pairs from the abovementioned cross-modal features. In our framework, we hypothesize that pairs established by modalities of the same sample are positive, whereas those formed by modalities of distinct ones are negative.
where t 1 , t 2 ∈ {h p , h r , v p , v r }, and B denotes the batch size in the training process. Adaptive Weighting The standard contrastive objective suffers from inflexible optimization due to irrational gradient assignment to positive and negative pairs. As a result, to tackle the problem, we propose the Adaptive Weighting Strategy for our contrastive framework. Initially, we introduce weights ϵ p and ϵ n to represent distances from the optimum, then integrate them into positive and negative terms of our loss. where Investigating the intuition to determine the values for o p and o n , we continue to conduct derivation and arrive in the following theorem Theorem 1 Adaptive Contrastive Loss (33) has the hyperspherical form: where C > 0 We provide the proof for Theorem (1) in the Appendix section. As a consequence, theoretically the contrastive objective arrives in the optimum when sim(t 1 i , t 2 i ) = o p 2 and sim(t 1 j , t 2 k ) = o n 2 . Based upon this observation, in our experiments we set o p = 2 and o n = 0.

Training Objective
For the Review Helpfulness Prediction problem, the model's parameters are updated according to the pairwise ranking loss as follows where r + and r − are random reviews in which r + possesses a higher helpfulness level than r − . We jointly combine the contrastive goal with the ranking objective of the Review Helpfulness Prediction problem to train our model

Datasets
We evaluate our methods on two publicly available benchmark datasets for MRHP task: Lazada-MRHP and Amazon-MRHP. We present the statistics of them in Table 2.

Implementation Details
We use a 1-layer LSTM with hidden dimension size of 128. We initialize our word embedding with fastText embedding (Bojanowski et al., 2017) for Lazada-MRHP dataset and 300-dimensional GloVe pretrained word vectors (Pennington et al., 2014) for Amazon-MRHP dataset. We set our multimodal attention module to have D = 5 attention layers. For the visual modality, we extract 2048dimensional ROI features from each image and encode them into 128-dimensional vectors. Our   entire model is trained end-to-end with Adam optimizer (Kingma and Ba, 2014) and batch size of 32. For the training objective, we set the value of the margin in the ranking loss to be 1.

Baselines
We compare our proposed architecture against the following baselines: • BiMPM ): a ranking model which encodes input sentences in two directions to ascertain the matching result.
• Conv-KNRM (Dai et al., 2018): a CNNbased model which encodes n-gram of multiple lengths and uses kernel pooling to generate the final ranking score.
• EG-CNN (Chen et al., 2018): a CNN-based model targeting data scarcity and OOV problem in RHP task via taking advantage of character-based representations and domain discriminators.
• PRH-Net (Fan et al., 2019): a baseline to predict helpfulness of a review by taking into consideration both product text and product metadata.
• DR-Net (Xu et al., 2020): a cross-modality approach that models contrast in associated contexts by leveraging decomposition and relation modules.
For Amazon dataset, which is written in English, our model outperforms MCR on all 3 categories, by NDCG@5 of 1.4 points in Clothing, 2.7 points in Electronics, and 1.2 points in Home, respectively. These results have verified that our interaction module and optimization approach can come up with more useful multimodal fusion than previous stateof-the-art baselines, not only in English context but other language one as well.
We also perform significance tests to evaluate the statistical significance of our improvement on two datasets Amazon-MRHP and Lazada-MRHP, and note p-values in Table 5. As shown in the table, all of the p-values are smaller than 0.05, verifying the statistical significance in the enhancement of our method against prior best MRHP model, MCR (Liu et al., 2021b).

Case Study
In Table 1, we introduce an example of one product item and two reviews extracted from Electronics category of Amazon-MRHP dataset. Whereas MCR fails to predict relevant helpfulness scores, our model successfully produces sensible rankings for both of them. We hypothesize that our Multimodal Interaction module learns more meaningful representations and Adaptive Contrastive Learning framework acquires more logical hidden states of relations among input elements. Thus, our model is able to generate more rational outcomes.

Ablation Study
In this section, we proceed to study the impact of (1) Adaptive Contrastive Learning framework and (2) Cross-modal Interaction module.
Adaptive Contrastive Learning It is worth noting from Table 6 that plainly integrating contrastive learning brings less enhancement to the performance, with the improvement of NDCG@3 dropping 0.53 points in Lazada-MRHP dataset, NDCG@5 waning 0.84 points in Amazon-MRHP dataset. Furthermore, completely removing contrastive objective hurts performance, as NDCG@3 score decreasing 0.77 points in Lazada-MRHP, and MAP score declining 1.06 points in Amazon-MRHP. We hypothesize that the model loses the ability to learn efficient representations for crossmodal relations. Cross-modal Interaction In this ablation, we eliminate the cross-modal interaction module. As shown in Table 6, without the module, the improvement is downgraded, for instance, N@3 drops 1.89 points in Lazada-MRHP dataset, MAP shrinks 1.39 points in Amazon-MRHP dataset. It is hypothesized that without the module, the model is rigidly dependent upon the alignment nature among multimodal input elements, which brings about insensible modeling because in most cases, cross-modal elements are irrelevant to be bijectively mapped together.    Table 8: Intra-modal, Inter-modal, and Intra-review distances in Home category of Amazon-MRHP dataset. distances among input samples using standard distance functions. Table 7 and 8 reveal the results of our experiment. In particular, we estimate the cosine distance (CS) and L2 distance (L2) between tokens of (1) product text -review text and product image -review image (intra-modal), (2) product text -review image and product image -review text (inter-modal), and (3) product text -product image and review text -review image (intra-review), then calculate the mean value of all samples. As it can be seen, our frameworks are more efficient in attracting elements of helpful pairs and repelling those of unhelpful pairs.

Review Helpfulness Prediction
Past works that pursue Review Helpfulness Prediction (RHP) dilemma follow text-only approaches. In general, they extract salient information, for instance lexical (Krishnamoorthy, 2015), argument (Liu et al., 2017), and emotional features (Martin and Pu, 2014) from reviews. Subsequently, these features are fed to a standard classifier such as Random Forest (Louppe, 2014) in order to produce the output score. Inspired by the meteoric development of computation resources, contemporary approaches seek to take advantage of deep learning techniques to tackle the RHP problem. For instance,  propose multiperspective matching between review and product information via applying attention mechanism. Furthermore, Chen et al. (2018); Dai et al. (2018) adapt CNN models to learn textual representations in various views.
In reality, review content are not only determined by texts but also other modalities. As a consequence, Fan et al. (2019) integrate metadata information of the target product into the prediction model. Abavisani et al. (2020) filter out uninformative signals before fusing various modalities. Moreover, Liu et al. (2021b) perform coherent reasoning to ascertain the matching level between product and numerous review items.

Contrastive Estimation
Different from architectural techniques such as Knowledge Distillation (Hinton et al., 2015;Hahn and Choi, 2019;Nguyen and Luu, 2022) or Variational AutoEncoder (Zhao et al., 2020;Wang et al., 2019), Contrastive Learning has been introduced as a representation-based but universal mechanism to enhance natural language processing performance. Proposed by Chopra et al. (2005), Contrastive Learning has been widely adopted in myriad problems of Natural Language Processing (NLP).
As an approach to polish text representations, Gao et al. (2021); Zhang et al. (2021); Liu et al. (2021a); Nguyen and Luu (2021) employ contrastive loss to advance sentence embeddings and topic representations. For downstream tasks, Cao and Wang (2021) propose negative sampling strategies to generate noisy output so that the model can learn to distinguish correct summaries from incorrect ones in Document Summarization. For Spoken Question Answering (SQA), You et al. (2021) introduce augmentation algorithms in their contrastive learning stage so as to capture noisy-invariant rep-resentations of utterances. Additionally, Ke et al. (2021) inherit the formulation of the contrastive objective to construct distillation loss which transfers knowledge of the previous task to the current one. Their proposals are to improve tasks in the Aspect Sentiment Classification domain. Unfortunately, despite the surge of interest in exercising contrastive learning for NLP, research works to adapt the method to the MRHP task have been scant.

Conclusion
In this paper, we propose methods to polish representation learning for the Multimodal Review Helpfulness Prediction task. In particular, we aim to advance cross-modal relation representations by learning mutual information through contrastive learning. In order to further enhance our framework, we propose an adaptive weighting strategy to encourage flexibility in optimization. Moreover, we integrate a cross-modal interaction module to loose the model's reliance on unalignment nature among modalities, continuing to refine multimodal representations. Our framework is able to outperform prior baselines and achieve state-of-the-art results on the MRHP problem.

Limitations
Despite the novelty and benefits of our method for Multimodal Review Helpfulness Prediction (MRHP) problem, it does include some drawbacks. Firstly, even though empirical results demonstrate that our approach not only works in English contexts, we have not conducted the verification in multilingual circumstances, in which product or review texts are written in different languages. If a model is corroborated to work efficiaciously in such contexts, it is capable of providing myriad benefits for practical implementation, for example, e-commerce applications can leverage such one single model for multiple cross-lingual scenarios. Furthermore, our work can also be extended to other domains. For instance, in movie assessment, we need to determine whether the review suits the material in the film, or visual scenes in the comment are consistent with the textual content. These would form our prospective future directions.
Secondly, in the MRHP problem, there are several relationships that contrastive learning could exploit to burnish the performance. In particular, performing contrastive discrimination between two sets of reviews is able to furnish the model with useful set-based representations, which consolidate general knowledge for better helpfulness prediction. Similar insights are applicable for two sets of product information. At the moment, we leave such promising perspectives for future work.

A Hyperspherical Form of Adaptive Contrastive Loss
We have the initial formulation of the adaptive contrastive loss We first substitute ϵ p i = [o p − sim(t 1 i , t 2 i )] + and ϵ n j,k = [sim(t 1 j , t 2 k ) − o n ] + into the above equation, where C = o p 2 2 + o n 2 2 . Now we obtain the spherical form of our contrastive loss.