A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues

Conditional inference on joint textual and visual clues is a multi-modal reasoning task that textual clues provide prior permutation or external knowledge, which are complementary with visual content and pivotal to deducing the correct option. Previous methods utilizing pretrained vision-language models (VLMs) have achieved impressive performances, yet they show a lack of multimodal context reasoning capability, especially for text-modal information. To address this issue, we propose a Multi-modal Context Reasoning approach, named ModCR. Compared to VLMs performing reasoning via cross modal semantic alignment, it regards the given textual abstract semantic and objective image information as the pre-context information and embeds them into the language model to perform context reasoning. Different from recent vision-aided language models used in natural language processing, ModCR incorporates the multi-view semantic alignment information between language and vision by introducing the learnable alignment prefix between image and text in the pretrained language model. This makes the language model well-suitable for such multi-modal reasoning scenario on joint textual and visual clues. We conduct extensive experiments on two corresponding data sets and experimental results show significantly improved performance (exact gain by 4.8% on PMR test set) compared to previous strong baselines.


Introduction
Cross modal reasoning is a hot research topic both in natural language processing and computer vision communities. Most cross modal reasoning tasks, such as Visual Question Answering (Antol et al., 2015;Wu et al., 2017;Shah et al., 2019; Figure 1: A case from the PMR (Dong et al., 2022) data set, where the correct option is answer B. The bluecolor words represent the pivotal textual clue to infer the correctness of answers A and B. Yusuf et al., 2022), Visual Dialog (Zhang et al., 2022;Chen et al., 2022), Visual Entailment, (Xie et al., 2019;Do et al., 2020) and Visual Commonsense Reasoning (Zellers et al., 2019a;Ye and Kovashka, 2021;Li et al., 2022a), concentrate on the visual reasoning scenario that relies primarily on image information. The given text (or question) is highly attached to the image and lacks prior permutation, e.g., the common question "Why is person 4 pointing to person 1" shown in VCR (Zellers et al., 2019a) data set. For another practical cross modal reasoning scenario (Dong et al., 2022), the textual modality often provides prior permutation or complementary information with the source image, such as the commonsense knowledge, and the personalities, feelings, or relationships of persons, as the premise shown in Figure 1. In this paper, we focus on such conditional inference on joint textual and visual clues, where the specific task form is to select the correct option from the candidate set according to the given textual premise and image.
Previous methods (Chen et al., 2020;Krojer et al., 2022;Dong et al., 2022; usually input the concatenated sequence of textual premise, image, and candidate answer into powerful pretrained vision-language models (VLMs) and employ a task-specific classi-fier to infer the result with attention to the joint representation obtained from VLMs. Although these methods work well for reasoning based mainly on visual clues, they suffer from one major shortcoming: the reasoning process does not fully utilize the abstract semantic information of given premise text to perform in-context reasoning. As the case shown in Figure 1, pretrained VLMs know "person [1] sits on the couch, not the bed" from the image, yet struggle to effectively infer that the person will "have a rest on the couch" according to "feels very tired" presented in the premise. It may be attributed to that pretrained VLMs mostly map different modalities into a unified space (Long et al., 2022) and perform cross modal semantic alignment and fusion. They neglect the in-context learning based on the given multi-modal semantics of language and vision during pertaining, like next sentence prediction. Fortunately, pretrained language models (PLMs) such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), BART (Lewis et al., 2020), and GPT3 (Brown et al., 2020), are powerfully capable of in-context learning and have achieved successful performance on natural language inference and open-ended text generation. PLMs can infer the next-step intent according to the given abstract text information compared to pretrained VLMs. Hence, we propose a simple and effective Multi-modal In-Context Reasoning approach named ModCR for this multi-modal reasoning task, taking advantages of VLMs and PLMs.
Specifically, ModCR employs a pretrained visual encoder equipped with a vision mapping network to obtain the image representation and convert it into the learnable visual prefix. The visual prefix and textual premise are regarded as two types of pre-context. They will be fed to the in-context reasoner, i.e., language model, to infer the correctness of answer. Considering the semantic gap between visual prefix and text in the language model, we first utilize a multi-grained vision-language semantic alignmenter to gain the multi-view alignment representation between image and text. Afterwards, we devise an alignment mapping network to capture the pivotal alignment information and convert it into the learnable cross-modal alignment prefix. Finally, we fed the two prefixes, premise, and answer into the language model to perform cross modal reasoning in the instruction template-based slot-filling method. In this way, ModCR bridges the semantic gap between visual content and text in the language model through introducing the crossmodal alignment prefix. It makes use of the abstract semantic of premise and objective image information via the self-attention mechanism in PLMs.
To verify the effectiveness of ModCR, we conduct extensive experiments on two cross modal reasoning data sets: PMR (Dong et al., 2022) and VCR (Zellers et al., 2019a). The experimental results show that the proposed method significantly outperforms previous strong baselines. The ablation and case studies indicate that ModCR is capable of in-context reasoning based on multi-modal information.
Our contributions can be summarised as follows: • We propose a multi-modal in-context reasoning framework for conditional inference on joint textual and visual clues, utilizing the incontext learning capability of PLMs.
• To the best of our knowledge, we are the first to introduce the multi-view alignment information between vision and language into the language model to perform cross modal reasoning, bridging the semantic gap between vision and language in PLMs.
• Experimental results show that ModCR achieves state-of-the-art performance on two corresponding data sets. It significantly outperforms previous vision-aided language models and pretrained VLMs-based approaches.
Over the past few years, significant performance has been made for developing vision-language models, owing to the Transformer (Vaswani et al., 2017) architecture and large-scale multi-modal web data (Bugliarello et al., 2021;Lin et al., 2021). These pretraind VLMs could be divided into singlestream  and double-stream (Radford et al., 2021;Lu et al., 2022a) types according to multi-modal information interaction methods. Our work explores how to expand and ameliorate pretrained VLMs to conditional inference on joint textual and visual clues.
Vision-aided Language Models. Images can provide explicit and diverse visual information to improve the imaginative representation of language. Recent works show that vision-aided language models have achieved promising performance on natural language understanding (Lu et al., 2022b) and open-ended text generation tasks  such as text completion (Zellers et al., 2019b), story generation (Fan et al., 2018), and concept-to-text (Barzilay and Lapata, 2005). Some works (Shi et al., 2019;Lu et al., 2022b) proposed to retrieve images corresponding to texts from the image corpus and use visual knowledge to improve the performance on the downstream tasks. Recently, some researchers (Long et al., 2021; proposed to utilize the powerful text-to-image technical to obtain the imagination representation of language and infuse them into the language model via the prefix-tuning (Li and Liang, 2021) way. In this paper, we also compared the visual prefix-based prompt learning methods (Liang et al., 2022;Jin et al., 2022;Tsimpoukelli et al., 2021), which has been verified to improve the performance of pretrained language models.

Overview
ModICR focuses on infusing the given multi-modal information: premise, image, and answer, into the language model to make conditional inferences based on textual and visual clues. The overview of ModICR is illustrated in Figure 2. Specifically, given the premise P = (p 1 , ..., p M ), image I and answer candidates A = (a 1 , ..., a Y ), where p i , a i indicate the i th token of premise and the i th answer in the candidate set respectively, we first use the visual encoder to obtain the image representation, which is projected into the visual prefix to provide the objective environment information.
Considering a semantic gap between visual prefixes and text when the language model performs context learning, we devise an alignment mapping network based on a multi-grained vision-language semantic alignmenter to gain the cross-modal align-ment prefix. Finally, the two-type prefixes, premise text, and answer candidate are fed to the language model via the instruction learning way to perform multi-modal context reasoning.

Base Model
Previous methods (Dong et al., 2022;Chen et al., 2020;Yu et al., 2021a) adopt the pretrained visionlanguage model to obtain joint representation of text and image during inferring. Similarly, we utilize the pretrained single-stream bidirectional encoder Oscar  as the backbone of the visual encoder and multi-grained vision-language semantic alignmenter. In this case, the image feature is first extracted by the widely-used tool Faster- RCNN (Ren et al., 2015) and fed into the visual encoder and alignmenter. Oscar mainly make the token-level semantic alignment between image and text. Hence, following Yang et al. (2022), we pretrain Oscar-based chunk-aware semantic interactor on the Flickr30k Entities (Plummer et al., 2015) data set to perform the phrase-level semantic alignment between text and image.

Mapping Networks
We denote the obtained sequence representation of the image and the text aligned with the image features to , respectively, where h I i indicates the output hidden state of i th image region (obtained by FasterRCNN). h ta i or h pa i represents the token-level or phrase-level aligned representation of i th token in answer text. N is the token length of answer. Similarly, h Ig , h tag , and h pag show the global representations of image, tokenlevel and phrase-level alignment information, respectively. However, the obtained visual and alignment embedding vectors may lie in a representation space different from the language model (used in the multi-modal context reasoner) due to the discrepancy across models. To alleviate this gap, we adopt the feature mapping network (Mokady et al., 2021) to project them into the corresponding learnable prefixes. Vision Mapping Network (VMN). As the top blue part shown in Figure 2, we use the visual encoder to encode the image and employ a vision mapping network to project image representation H I into the sequence of visual prefix V = (v 1 , ..., v l ) with the mixed length l. v i represents the i th visual For VMN, we adopt a two-layer perceptron with a ReLU activation function. It could be pretrained on large-scale image-text pairs for projecting visual features into the visual prefix that has the same space distribution as word embedding in LMs. Alignment Mapping Network (AMN). It is capable of capturing the multi-view semantic alignment information of image-text pair and converting it into the cross-modal alignment prefix. Such prefix can bridge the semantic gap between visual prefix and text in the language model, enhancing the interactive understanding of image-text information. Specifically, we first apply a two-layer transformer to capture the pivotal multi-view alignment information lied in H ta and H pa . The specific calculation process of the first layer is as follows: where W dr and b dr are learnable parameters. cross represents the cross-attention calculation process. [, ] shows the concatenate computation. After doing the same two-layer calculation, we obtain the pivotal alignment representation h ag . Secondly, we project it into the cross-modal alignment prefix via a similar calculation process as the vision mapping network (Eq. 1). Finally, we gain an alignment prefix representation A = (a 1 , ..., a m ), where a i indicates the i th alignment embedding and m is the length of prefix. By doing so, AMN could capture the pivotal semantic alignment information and project them into the learnable prefix vectors in the word embedding space.

Multi-Modal Context Reasoner
After obtaining two types of the prefix, we infuse them into an context reasoner to conduct cross modal reasoning, where we adopt the pretrained language model RoBERTa (Liu et al., 2019) as the context reasoner. We utilize the widely used instruction-learning method to incorporate the whole context encoding information. Specifically, we fill visual prefix, alignment prefix, premise and answer candidate in a pre-defined instruction template, "<cls> Is Answer correct or wrong based on conditions? <sep> Conditions: The Image is <V>, Bridge between the following text and image is <A>, Premise Text is <Premise Text> <sep> Answer is <Answer candidate>. ". These special symbols, <V>, <A>, <Premise Text>, and <Answer candidate>, will be replaced by the obtained prefix vectors V and A, and word embedding representations of premise and answer in turn. The sequence representation is fed into the context reasoner to infer the final result. This way, we can utilize the context learning capability of pretrained language model to tackle the multi-modal reasoning problem. We obtain the inferring result of each answer candidate by applying a two-layer perceptron with the ReLU activation function on the output hidden state h cls of the top layer in RoBERTa. The whole training objective of ModICR can be defined as where x i is the output probability on i th answer candidate and q is the label.

Training and Inference
To make Eq. 2 in the alignment mapping network capture pivotal multi-view alignment information, we will first train it about one epoch for alleviating the cold start problem leading to the collapse of the network. Concretely, we use a linear function to project h ag into the confidence score and employ the cross entropy loss to optimize it locally with the golden label q. The training process is regarded as L 1 . Thus, the whole training process could be defined as where steps shows the optimization step during training and N whole represents the start of the whole training. For inference, we input each answer candidate with premise and image into ModICR to obtain the confidence score and adopt the maximum one as the final result.

Data sets
Conditional inference on joint textual and visual clues is a task that the text provides the prior permutation or the complementary information (external knowledge) with the image. There are few data sets that meet the above requirement in the community. To verify the effectiveness of the proposed model, we first adopt the high-quality human-constructed PMR (Dong et al., 2022) data set, which contains 12,080 training samples, 1,538 validation samples and 1,742 testing samples. Textual premises pass the human cross-check annotation and contain six categories: relationship, personality, mood, and so on. In addition, we also reorganized a corresponding large-scale data set according to the VCR data set (Zellers et al., 2019a). We combine the given correct rationale and question as the textual premise and reform the original task into inferring the answer based on the new premise and image, i.e., QR→A. This way, the rationale could provide external knowledge information different from the source image. We set the original validation as the test set and selected some training samples as the validation set. Finally, the samples are divided into 210k training/2,923 validating/ 26,534 testing.

Baselines
We compare the proposed method to pretrained LMs and VLMs as follows: BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) are both the transformer-based large language model, having achieved impressive performance on many natural language understanding tasks. We fine-tune them with only access to the textual premise.
VL-BERT (Lu et al., 2019) is a dual-stream pretrained cross-modal model. It adopts the BERT architecture, and the visual feature are concatenated with text embedding.
ERNIE-VL (Yu et al., 2021a) is a single-stream fusion encoder. It utilizes the structured knowledge obtained from scene graphs to learn joint representations of vision and language.
UNITER (Chen et al., 2020) also expands the BERT architecture to incorporate visual information and power heterogeneous downstream visionlanguage tasks with joint multi-modal embeddings.
Oscar  is also a single-stream fusion encoder that uses object tags detected in images as anchor points to ease the learning of alignments significantly.
OFA ) is a sequence-sequence cross-modal learning framework that unifies a diverse set of cross-modal and unimodal tasks, including visual grounding, image captioning, image   Dong et al. (2022). For baselines, "-B" and "-L" indicate the base and large version, respectively. The underscore and bold indicate the second highest value and best performance (same as following tables). "frozen VLMs" and "fine-tune VLMs" represent whether the parameters of the visual encoder and multi-grained vision-language alignmenter are involved in training.
classification, language modelling, etc. MVPTR (Li et al., 2022b) is a pretrained cross model that introduces the multi-level semantic alignment of vision-language to facilitate representation learning synergistically.
CALeC ) is a unified prediction and generation model for some visionlanguage tasks, which introduces the chunk-aware semantic interactor to improve the semantic alignment representation and uses the lexical constraint technical to promote the quality of generation.
PromptFuse (Liang et al., 2022) is a promptbased learning method to infuse visual information into the language model. It randomly initializes two learnable vectors as the alignment prefix to improve the space representation projection of image and text and bridge the semantic gap between the visual prefix and text.

Implementation Details
We use the Adam (Kingma and Ba, 2014) optimizer to train the above models on 2 A100 GPUs with a base learning rate of 2e-5, a batch size of 32, and a dropout rate of 0.1. For each sample, we set the maximum number of visual regions extracted by BERT-B (Devlin et al., 2019) 65.2 19.8 19.6 4.5 Oscar-B  76.1 10.2 12.1 1.7 RoBERTa-L (Liu et al., 2019) 75.0 17.7 6.1 1.2 PromptFuse (Liang et al., 2022) 76.5 16.5 5.8 1.2 ERNIE-VL-L (Yu et al., 2021a) 79.9 10.7 8.2 1.2 OFA-L  79.1 9.7 9.9 1.3 MVPTR (Li et al., 2022b) 78.9 7.5 11.8 1.8 CALeC  78.7 8.6 10.9 1.8 ModCR (frozen VLMs) 84.3 9.2 5.6 0.9 ModCR (fine-tune VLMs) 84.7 7.8 6.8 0.7  FasterRCNN to 10. We set N whole to 1 epoch and adopt the pre-trained parameters of the base version of Oscar to initialize the multi-grained visionlanguage semantic alignmenter. While training the chunk-level semantic interactor on the Flickr30k Entities data set, we follow the parameter settings presented in  and train it for about ten epochs. We adopt the Roberta large to initialize the multi-modal context reasoner. The visual and cross-modal alignment prefix lengths are both set to 5. All methods performed on the two data sets employ the validation set to select the best-performing model.

Main Results
Overall Performance. We report the performance of models on PMR and VCR (QR→A) data sets, which are shown in Tables 1 and 3. From the whole experimental results, we observe that the proposed  Table 4: The experimental results of ModCR with different prefix length on the PMR data set. We frozen the parameters of VLMs for all ModCR variants. "LV" and "LA" indicate the lengths of visual and alignment prefix respectively, where "=0" represents that the corresponding mapping network is removed.
method significantly outperforms previously strong baselines such as gain by 5.7%, 4.8% on the validation and testing of the PMR data set compared to CALeC and ERNIE-VL-L. According to the performance of BERT-B and RoBERTa (only text input), we know that the premise can provide vital information to infer the correct option. The performance is further improved when combined with visual content and cross-modal semantic alignment prefix for inference, e.g., ModCR (frozen VLMs) vs. RoBERTa: 84.3 vs. 75.0, PromptFuse vs. RoBERTa: 76.5 vs. 75.0. For model performances on VCR (QR-A), however, we observe that the pretrained VLMs have worse performance compared to RoBERTa-L, which displays that VLMs do not make good use of the abstract semantics of the premise for contextual reasoning. ModCR that takes the RoBERTa-L as the main backbone surpasses pretrained VLMs and LMs on two data sets, which suggests that our method effectively utilizes the semantic information of different modalities while performing reasoning. Is Context Reasoning Capability Improved? We present the detailed performances of models on the test set of PMR to check the ability of models to infer different types of answer candidates, which contain AT, D1, AF, and D2, as shown in Ta Table 5: The detailed performance of ModCR with different training strategies. "MappNet" indicates the two types of mapping networks. "✓" represents parameters of the module will be updated during training. The top 3 lines show the experimental results on the VCR (QR→A), and the bottom 3 lines is PMR.
better uses the abstract semantic information of premise to infer the correctness of the following action compared to VLMs, e.g., RoBERTa without visual information has the lowest error rate across all baselines in action recognition (AT). In addition, we also find that although the ability of recently proposed VLMs to reason with abstract textual clues has been improved, there is still a particular gap compared to LMs, e.g., AT performance: OFA-L (8.2) vs. RoBERTa (6.0). When employing the language model RoBERTa as the reasoner and infusing the visual information in it, we observe that the overall accuracy of the model is further improved. However, the previous vision-infusing method has a low utilization rate of visual information (D1: 16.5 for PromptFuse). As the bottom two lines shown in Table 2, ModCR, which utilizes the multi-view text-image semantic alignment information, maintains the abstract reasoning ability based on premise and also substantially improves the utilization rate of image information.
Through the above analysis, we can obtain that it is necessary to introduce vision-language semantic alignment information for vision-aided language models. Furthermore, there is still a large room for improvement in the contextual reasoning capability of the pretrained VLMs.

Ablation Studies
To analyze the effectiveness of ModCR in detail, we design multiple model variants and the experimental results are shown in Tables 4 and 5. We select the high-quality PMR (manual annotation and inspection) data set as the experimental scene of ablation studies. For PromptFuse (Liang et al., 2022), we adopt RoBERTa-L as the backbone and all parameters are updated during training. Is Alignment Mapping Network Effective? From Table 4, comparing ModCR performances with LA=0 and LA >= 1, we observe that the performance of ModCR drops markedly when it abandons vision-language semantic alignment information. Compared to PromptFuse that randomly initializes two learnable alignment prefix vectors, the proposed alignment mapping network equipped with the multi-grained cross-modal alignmenter is more effective, e.g., PromptFuse vs. RoBERTa-L: 76.5 vs. 75.0, and performance comparisons of ModCR vs. RoBERTa-L.

Effect of Prefix Length on Model Performance.
From the performance of the visual prefix and alignment prefix at different lengths in Table 4, we can see that the performance of ModCR varies greatly under different lengths for the two types of prefix. The ModCR performs best when both prefixes are taken as 5. Furthermore, excessively long visual prefixes impair the overall performance, which may be attributed to the fact that redundant and inaccurate visual prefix has an inferior effect on the context learning capability of language model.

Model Performance with Different Training
Strategies. We present the detailed performance of ModCR with different training strategies on Table 5. By comparing the experimental results of "frozen VLM" and "fine-tune VLM" on two data sets, we observe that the performance of the proposed method is further improved when all parameters of ModCR are updated during training. Although the training speed is slower, this could further integrate the complementary reasoning capabil-ities of VLM and LM. In addition, only finetuning MappNet has inferior performances, which may be addressed via pretraining on external large-scale image-text corpus.

Case Study
We report two cases in Figure 3 to analyse the performance of models in detail. The premise texts of two samples are about the character (top case) and relationship (bottom one) of persons respectively. Although pre-trained VLMs can infer whether the answer candidate satisfies the image content, they cannot effectively use the premise information to perform reasoning. Contrastly, ModCR utilizes the two-modal semantic information to determine the correct answer. It indicates that regrading two different cues as pre-context states and employing the context reasoning ability of language models is a simple and effective approach for cross modal reasoning tasks. In addition, ModCR could infer the description "in white shirt" and "lying on the bed" do not meet the image content (the boy wearing blue shirt and sitting on the chair), which may be attributed to the semantic alignmenter. To conclude, the alignment prefix can improve the whole performance of allowing the language model to understand the visual information and perform reasoning.

Conclusion and Future Work
In this paper, we propose a multi-modal context reasoning approach named ModCR for the scenario of conditional inference on joint visual and textual clues. It regards the given image and text as the two types of pre-context states and infuses them into the language model via the instruction learning method to perform such multi-modal reasoning. The experimental results on two data sets show the effectiveness of ModCR. For the future, we will explore two research directions: 1) how to improve the context learning capability of pretrained VLMs. 2) exploring the conditional inference on complex visual and textual clues, where it contains multiple clues lying in more modalities.

Limitations
The proposed method has several limitations: 1) The current approach achieves hunky context reasoning performance in the cross-modal scene of a single text clue and image, but the context reasoning capability in the scene containing multiple textual and visual clues still needs to be further explored, such as video and long text. 2) From the experimental results, we observed that the visual prefix length greatly impacts the stability of language models infused with visual information. Hence, we still need to explore effective and stable vision-aided language models for natural language processing and multi-modal scenarios. 3) We also hope this work could spark further research on improving the long context reasoning capability of pretrained vision-language models.
Transform-retrieve-generate: Natural languagecentric outside-knowledge visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5067-5077. B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified? For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)? because all models and data sets are open and can be used for research purpose.

ACL 2023 Responsible NLP Checklist
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it? all data sets are open.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.? sec.4 experiments: baselines.
B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created? Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results. For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be. sec.4 experiments: data sets C Did you run computational experiments?
we conduct extensive experiments in the section 4.
C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used? sec4.3 implementation details The Responsible NLP Checklist used at ACL 2023 is adopted from NAACL 2022, with the addition of a question on AI writing assistance.