A Neural Divide-and-Conquer Reasoning Framework for Image Retrieval from Linguistically Complex Text

Pretrained Vision-Language Models (VLMs) have achieved remarkable performance in image retrieval from text. However, their performance drops drastically when confronted with linguistically complex texts that they struggle to comprehend. Inspired by the Divide-and-Conquer algorithm and dual-process theory, in this paper, we regard linguistically complex texts as compound proposition texts composed of multiple simple proposition sentences and propose an end-to-end Neural Divide-and-Conquer Reasoning framework, dubbed NDCR. It contains three main components: 1) Divide: a proposition generator divides the compound proposition text into simple proposition sentences and produces their corresponding representations, 2) Conquer: a pretrained VLMs-based visual-linguistic interactor achieves the interaction between decomposed proposition sentences and images, 3) Combine: a neural-symbolic reasoner combines the above reasoning states to obtain the final solution via a neural logic reasoning approach. According to the dual-process theory, the visual-linguistic interactor and neural-symbolic reasoner could be regarded as analogical reasoning System 1 and logical reasoning System 2. We conduct extensive experiments on a challenging image retrieval from contextual descriptions data set. Experimental results and analyses indicate NDCR significantly improves performance in the complex image-text reasoning problem.


Introduction
Image-text retrieval tasks have made remarkable progress owing to pretrained Vision-Language Models (VLMs) such as LXMERT (Tan and * † Corresponding author. Figure 1: An example from the IMAGECODE (Krojer et al., 2022) data set, where the description is linguistically complex and images are minimally contrastive. The target image is in red and others are incorrect frames. The bottom part depicts the conventional method and the neural divide-and-conquer reasoning framework.
Bansal, 2019), UNITER , OS-CAR (Li et al., 2020b;, ViL-BERT (Lu et al., 2019), CLIP (Radford et al., 2021), and many others. These VLMs are usually trained on the large-scale short text-image corpus by crossmodal semantic alignment methods. They are capable of essential perceptual computing capability and excel at retrieving images from sentences with few objects and simple linguistic, e.g., "There is a duck swimming in the pond". However, when pretrained VLMs meet the case of retrieving the accurate image from similar candidates based on a linguistically complex text, as the example shown in Figure 1, previous works (Krojer et al., 2022;Talmor et al., 2021a;Thrush et al., 2022) show that they struggle to understand the elaborate description and perform complex cross-modal reasoning.
According to the dual-process theory for human thinking (Groves and Thompson, 1970;Evans, 2003;Pelaccia et al., 2011), human brains contain two thinking systems: System 1 performs analogi-cal reasoning well, which is fast yet unconscious; System 2 is capable of abstract logical reasoning, which is slow yet conscious and well-suitable for complex reasoning problems. The theory could also hold for the image-text retrieval tasks, and the widely adopted models (e.g., VLMs) focus on analogical reasoning as System 1 based on the analysis of deep learning networks (Bengio, 2017(Bengio, , 2019Bengio et al., 2021). For the linguistically complex description that contains multiple conditions, they have inferior performance, and we need to introduce logical reasoning System 2 more to cover and logically incorporate the scattered information in the description based on System 1. Inspired by the above investigations and classical Divide-and-Conquer (Smith, 1985) algorithm, we design an end-to-end Neural Divide-and-Conquer Reasoning framework named NDCR. As shown in Figure 1, our key idea is to regard the complex description as compound proposition text and solve the challenging retrieval problem in three steps: divide, conquer, and combine.
Specifically, Divide: NDCR first utilizes a proposition generator to divide the complex compound text and produce the global representation of simple proposition sentences with visually printing them. Conquer: we devise a visuallinguistic interactor to achieve the interaction between decomposed proposition sentences and images, which resembles System 1. It uses the Transformer (Vaswani et al., 2017)-based contextual interactor to achieve the inter-learning of different proposition-image pairs. Considering the incorrectness or information loss of simple proposition representation, we also present a modifier to incorporate the context reasoning information to improve their cross-modal reasoning states. Combine: we design a learnable neural-symbolic reasoner to integrate reasoning information of simple propositions logically. It first employs a negation executor to obtain a simple proposition sentence's negational reasoning hidden state and corresponding confidence score. Then, we use the global reasoning information of compound proposition text as the query signal to perform the conjunction operation across simple propositions and their negational information. Finally, as shown in Figure 1, we also combine the inferred results of the neural-symbolic reasoner (resembles System 2) and visual-linguistic interactor (resembles System 1) to obtain the final solution. In this way, the whole framework inte-grate the capabilities of Systems 1 and 2 to obtain better performance.
We conduct extensive experiments on a largescale image retrieval from contextual descriptions data set, IMAGECODE (Krojer et al., 2022). The experimental results indicate that NDCR achieves the state-of-the-art performance and the ablation and case studies verify the effectiveness of different modules.
Our contributions are as follows: • We propose a divide-and-conquer reasoning framework for image retrievals from linguistically complex text, where we first attempt to combine the perceptually analogical reasoning System 1 and neural-symbolic logic reasoning System 2 to solve the complex multimodal reasoning problem.
• We design a proposition generator capable of producing the global representation of decomposed simple proposition sentences for linguistically complex texts and visually printing them as text.
• Experimental results indicate our approach remarkably improves the performance, and we obtain the first place on the leaderboard 1 . Ablation and case studies confirm the effectiveness of introducing and combining logical reasoning System 2 based on System 1.

Related Works
Pretrained Vision-Language Models for Cross Modal Matching. Owing to the success of Transformer (Vaswani et al., 2017) architecture equipped with pretrain-finetuning (Erhan et al., 2010) learning method, pretrained VLMs have made a remarkable performance in cross-modal matching or reasoning tasks (Talmor et al., 2021b), especially image-text retrieval. Early pretrained VLMs utilize BERT (Devlin et al., 2019)-like single encoder architecture to encode and fuse the image-text information, then perform image-text reasoning such as ViLBERT (Lu et al., 2019), VisualBERT (Li et al., 2019), and Oscar (Li et al., 2020b). In addition, dual-encoder architecture such as CLIP (Radford et al., 2021), and ALBERT , performs better than single-encoder architecture on image-text matching tasks and is widely used in industry because of its efficiency. Divide-and-Conquer for Question Answering.
The divide-and-conquer algorithm (Smith, 1985) aims to divide the complex problem into multiple simple problems and then combine the subproblem results to achieve the final solution. This idea has been used in complex question-answering tasks in the natural language processing area. Zhang et al. (2019) proposed to utilize the decomposition of complex questions for semantic parsing. Min et al. (2019) adopt the question decomposition and rescoring method to perform multi-hop reading comprehension, which makes the reasoning path interpretable and robust. Wolfson et al. (2022) utilized the QDMR structures of complex questions to conduct the decompose-synthesize text-to-SQL transformation. Previous pipeline approaches may lead to error cascades in the upper inference process due to the incompleteness or error of decomposed text. The image-text retrieval task has strict requirements on the correctness of text semantic understanding, thus we propose an end-to-end divide-and-conquer method for alleviating the error cascade issue via the whole learning process.
Dual-Process Theory. The dual-process theory shows that human brains have two different thinking Systems. System 1 performs analogical reasoning, and System 2 performs conscious logical reasoning. Combining this theory with practical tasks, some researchers designed various approaches. Mittal et al. (2017) believed that combining vector space models with external knowledge graphs could be regarded as thinking 'fast' in vector space along with thinking 'slow' and 'deeply' by reasoning over the knowledge graph. Anthony et al. (2017) also proposed to use a deep learn-ing network with a tree search engine as System 1 and System 2, respectively, for sequential decisionmaking problems. Bengio (2017Bengio ( , 2019 advocated the design of a conscious network to achieve the leap from System 1 to System 2. Liu et al. (2022) designed a neural-symbolic system for natural language understanding tasks, which combines the explicit symbolic calculation-based System 2 and fast deep learning network-based System 1. For complex multi-modal reasoning problem, e.g., image retrieval from linguistically complex text, humans usually combine System 1 and System 2 to obtain the final solution. However, current methods relying mainly on deep learning networks resemble System 1 and lack the logical reasoning capability, thus suffering from image-text reasoning with the complex description. In this light, we make the first attempt to combine System 1 and System 2 to tackle this issue by designing a neural divideand-conquer reasoning framework. We introduce a neural-symbolic reasoner in System 2 to conduct the logical operation. The overall framework contains analogical and logical reasoning as humans think, making appreciable gains.

Overview
Image retrieval from contextual descriptions (Krojer et al., 2022) aims to infer the correct image given a linguistically complex text Y = (y 1 , ..., y N ) and similar images I = (I 1 , ..., I L ), where y i , N , I i , and L represent the i th token, the total length of text, i th image, and the number of images, respectively. We propose a novel divide-and-conquer reasoning framework to tackle such a task. It consists of three components, namely, Proposition Generator, Visual-Linguistic Interactor, and Neural-Symbolic Reasoner, which are coupled and trained in an end-to-end manner. Specifically, the proposition generator divides the complex description into multiple proposition sentences, allowing it to convert the complex matching problem to simple ones. Afterwards, the visual-linguistic interactor achieves the interaction between decomposed proposition sentences and images, resembling System 1, to perform the essential analogical reasoning. Subsequently, the neural-symbolic reasoner that relies on the reasoning state output by the visual-linguistic interactor resembles System 2 to perform logical reasoning. Finally, we also combine the output results of System 1 and System 2 to obtain the final solution.

Proposition Generator
The proposition generator is a sequence-tosequence model based on the pretrained language model BART. As shown in Figure 2, it employs the encoder to obtain the text representation H Y = (h cls , h y 1 , ..., h y N ) where h y i represents the i th token hidden state. Subsequently, we design a two-layer semantic parsing module to gain the global representation of simple proposition sentences. Concretely, we set the maximum number of simple propositions to 10 and randomly initialize them. The initial vectors are fed to the semantic parsing module to interact with the compound text representation. Take the first layer as an example; the calculation process is following, where h I is the randomly initial proposition representations. Attention and FNN calculation subnetworks are identical to the transformer (Vaswani et al., 2017) architecture. Different from the transformer, we let the output of Cross-Attention layer subtract the output of Self-Attention layer, aiming to achieve information differences across propositions. By doing the same two-layer calculation, we obtain ten global hidden states of simple propositions. Due to context containing different numbers of simple proposition, we use a MLP to predict the target number of simple proposition sentences. It only attends to the global hidden state h cls of compound proposition text. Suppose that the predicted number M of simple propositions is 3 (same as Figure 2), we adopt the first-three hidden states of the semantic parsing module as the global representation of the targeted simple proposition. As shown in Figure 2, for explaining what simple propositions represent, we also use the decoder of BART to generate the simple proposition sentence with only attending to their global representations.

System 1: Visual-Linguistic Interactor
After obtaining the global representations of simple proposition sentences, we introduce the visuallinguistic interactor to mine the interaction of image-proposition pairs. Specifically, we use a pretrained visual encoder to obtain the image encoding representations H I = (h I 1 , ..., h I L ) and fuse them with the simple proposition representation via the dot-product way (as the "F" shown in Figure 2). The two-modal fusion process is where λ is the hyperparameter set to enlarge the scale of fused vectors. We denote the fused sequence representation of proposition-image pairs to H(p) = (H(p 1 ), ..., H(p M )) where H(p 1 ) indicates the sequential representation of first proposition combined with images.
Then, we employ a two-layer transformer to perform the contextual information interaction for fused sequential representations H(p) and obtain the initial reasoning states of simple proposition on images. Considering the incorrectness or information loss of simple proposition representation obtained by the proposition generator, we introduce a MLP-based modifier to incorporate the reasoning state of compound proposition text to enhance previous initial reasoning states of simple propositions. The whole process is performed as Eq. 2, where H C indicates the fusion information of the compound proposition text and images, gained by the cross-modal encoder (arr. cross encoder as shown in Figure 2). W M 1 ∈ R 2d×d and W M 2 ∈ R 2d×2d are learnable parameters. Before feeding H(p) into the transformer, we introduce the learnable position embeddings P E to facilitate it pay attention to the contextual information across images. After obtaining the final reasoning state H S 1 = (h + 1 , ..., h + M ) of simple propositions in System 1, we adopt a linear prediction head to produce the confidence score of each proposition to images, which are defined as P S 1 = (p + 1 , ..., p + M ) and p + M ∈ R 1×L .

System 2: Neural-Symbolic Reasoner
For complex reasoning problems, the logical reasoning process usually plays a more significant role for intelligent machines and human reasoning (Bengio, 2019), which the visual-linguistic interactor is not capable of. Instead of combining the inferring results in System 1 via rule-based methods such as mean pooling, inspired by Shi et al. (2020); , we devise a learnable Neural-Symbolic Reasoner (NSR) to perform logical reasoning based on System 1 as shown in Figure 2. As depicted in Figure 3, it contains a negation executor to obtain the negational reasoning states and a conjunction operation to acquire the result of logical reasoning with attention to the positive and negational reasoning information. Negation Executor. The negation executor is a module that takes the reasoning state of a simple proposition as input and produces the corresponding reasoning state of its negation as output. Its aim is to obtain useful cross-modal reasoning states for the negation of a proposition. We regard H S 1 as the positive reasoning state and use a two-layer MLP with the ReLU activation function to obtain the negational reasoning state. The calculation process is given in Eq. 3, where W n 2 , W n 1 ∈ R d×d , b n 1 , b n 2 ∈ R 1×d are learnable parameters. We define the output of negation executor to H N = (h − 1 , ..., h − M ), contrast to H S 1 .The negational proposition has a different cross-modal reasoning state H N than the corresponding positive proposition H S 1 . We use the same linear prediction head as System 1 to produce the corresponding confidence score on images, which are presented to P N = (p − 1 , ..., p − M ). To make the negation executor effective, we will define a negational feedback loss to locally optimize it.
Conjunction Operation. Firstly, we define a new joint representation that incorporates reasoning hidden states and corresponding confidence scores as the initial state of conjunction operation. The process is presented in Eq. 4, where [, ] indicates the concat calculation and H I is the representation of images. H ns p + i represents the positive joint representation of i th proposition. We use the same calculation method as Eq. 4 to obtain the initialized negational representation H ns p − i . Then, we utilize the reasoning state of compound proposition text H sg C (Eq. 2) as the signal to drive the conjunction calculation via the method of multi-head attention equipped with gate fusion, as shown in Figure 3. The whole calculation process is presented in Eq. 5, where W s ∈ R 2d×2d , W g ∈ R 1×4d , W S 2 ∈ R 2d×d are the learnable parameters and H f ∈ R 1×L×d . We also utilize another linear prediction head to obtain the final confidence score of neural-symbolic reasoner, which is defined as P S 2 ∈ R 1×L .

Combining System 1 and System 2
In addition, we combine inferring confidence scores in System 1 and System 2 to obtain the final solution, achieving the complementarity of System 1 and System 2. First, we need to acquire the whole representation of H S 1 and H f as follows: where W l ∈ R d×1 , b l are learnable parameters. where are learnable parameters and f (.) indicates the sigmoid activation function. This way, we can obtain the final result via taking the maximum one of the confidence score P f ∈ R 1×L .

Training Strategies
To make the proposition generator perform proposition decomposition and generation effectively, we train it on a large-scale corpus solely and then train the whole NDCR framework on the specific training data. The two training phases are as follows: Phase 1. We first pretrain the proposition generator on the released large-scale complex text simplification data set MinWikiSplit (Niklaus et al., 2019), which is composed of 203K pairs of aligned complex source and simplified target sentences. We adopt the cross entropy generation loss L g for the decoder output. Similar to SimCSE , we employ the contrastive learning loss L c to make the global representation of simple proposition sentence different. In addition, we use a crossentropy multi-label classification loss L p to train the prediction head of numbers of propositions, where the label is the number of simple sentences in the pretraining corpus. The whole training loss: Phase 2. While training NDCR, we employ the proposition sentence-image confidence score to calculate the classification loss. The loss will cover the output of System 1, System 2 and final solution, which is defined as follows: where p i ∈ R 1×L and q is the golden label. To make the negation executor effective, we devise a negational feedback loss L neg to optimize it. We take the prediction result of modifier in System 1 as the positive distribution and make the belief distribution output by the negation executor on the image candidates be far away from positive distribution. The loss calculation method is shown in Eq. 10, where KL indicates the K-L Divergence (Kullback and Leibler, 1951). θ is a super-parameter used to expand the positive and negational interval, which is set to 0.2. Hence, the whole optimization target is L match + L neg .

Dataset
We conduct extensive experiments on a challenging data set IMAGECODE (Krojer et al., 2022), which contains 94,020 images, and they are divided into 9,402 sets. The overall images are collected from four released data sets: MSR-VTT (Xu et al., 2016), Video-Storytelling (Li et al., 2020a), YouCook (Das et al., 2013), and Open Images V6 (Kuznetsova et al., 2020). It consists of 21,202 human-writing complex descriptions and manually labelling corresponding golden images, which are divided into 16,594, 2,302, and 2,306 for training, validating, and testing, respectively. The image sources in the overall data set include video frames and static images.

Baselines
We compare NDCR with various types of pretrained VLMs and other designed models based on the specific condition of this task. Specifically, ViLBERT (Lu et al., 2019) is a cross encoder where language and vision interact in the transitional layer via cross attention calculation. CLIP (Radford et al., 2021) is a two-stream vision-language encoder with two independent visual and textual encoders. UNITER ) is a singlestream encoder where visual representations and text tokens are concatenated and interact via the same transformer. OFA (Wang et al., 2022) is a unified cross-modal and unimodal encoder and has achieved impressive performance on multiple cross modal reasoning tasks. Krojer et al. (2022) also designed a contextual module to improve the interaction across different text-image fusion representations, achieving state-of-the-art performance.

Implementation Details
The L, λ, and d equal 10, 1000, and 512, respectively. For the proposition generator, we adopt a two-layer semantic parsing module and the pretrained parameters of BART-base version. We set the maximum number of propositions to 10 and trained the proposition generator for 15 epochs on the MinWikiSplit data set. In addition, we set the depth of transformer block to 2 in the visuallinguistic interactor and utilized the finetuned visual encoder of CLIP (ViT-B/16) to encode images. For the cross encoder, we adopt the OFA-large architecture and first finetune it for two epochs before training the overall structure of NDCR. We froze the cross encoder, proposition generator, and visual encoder to prevent overfitting while training NDCR. While training all models, we set the batch size, initial learning rate, and dropout rate to 36, 6 × 1e −5 , and 0.1, respectively. The maximum training epoch is set to 30, and we employ the Adam Optimizer (Kingma and Ba, 2014) with the initial learning rate declining linearly to train all models. We use the validation set to select the best-performing model.

Main Results
Overall Performance. We present the performance of NDCR and comparative models on the test set in Table 1. ' †' indicates that the pretrained VLMs are equipped with the contextual module and temporal embedding to enhance the contextual semantic interaction across similar images. This variant shows its effectiveness on the case of video frame according to the comparative performances such as CLIP vs. CLIP † . Table 1 reports that the proposed method achieves new state-of-the-art performance on the whole test set and significantly outperforms previous strong baseline (34.1 vs. 29.9, ↑ 4.2). NDCR improves performances both on video frames and static images, especially static images (↑ 4.3), which shows its generalization on different cases. We observe that all models perform poorly on the testing samples whose images are from the Method ↓ Type → All Video Static CLIP (Radford et al., 2021) 28.4 20.0 60.0 CLIP † (Krojer et al., 2022) 29.9 22.0 59.8 UNITER  24.8 17.4 52.8 UNITER † (Krojer et al., 2022) 25.7 19.1 50.5 ViLBERT (Lu et al., 2019) 20.9 15.0 42.7 ViLBERT † (Krojer et al., 2022)   video clips, which may be attributed to the high similarity across video frames. Hence, there is a big room to improve the whole performance on the challenging multi-modal reasoning task.

Ablation Study
Effectiveness of Modules. To study the effectiveness of different modules, we re-annotate the test sample with the help of eight related workers (original test labels are not released). The experimental results are presented in Table 2. The performances of reproduced baselines and NDCR have a slight decline, which is because the labelling process for most examples is difficult. There are specific quality differences across human-labelling results, yet it does not affect testing and comparing model perfor-  mances. For the fairness of model comparison, the random seeds of all ablation experiments are set to the same value 10. Firstly, NDCR achieves the best performance and significantly surpasses other models on two test sets. When we add System 2 based on System 1, the overall performance improves by about 1.0, suggesting the neural-symbolic reasoner's effectiveness. Comparing System 2 and System 2 w/o negation, we observe that the negation executor improves the performance of the neural-symbolic reasoner, mainly in the case of static images. In addition, comparing System 1 and System 1 w/o modifier, we observe that introducing the context reasoning information is a very useful way to enhance the reasoning state representation of decomposed simple proposition sentences. Compared to the best baseline OFA-large (470M), the total parameter size of NDCR is about 440M. NDCR has fewer parameters yet significantly outperforms it (as shown in Table 2). This suggests that the overall performance improvement of NDCR is not due to having larger parameters. System 1 vs. System 2. We count the experimental results on the test set according to the number of simple proposition sentences into which compound proposition texts are divided. The results are shown in Table 3. The statistical results show that NDCR excels at image retrieval from complex text with medium length, especially for those containing three simple proposition sentences. It verifies the proposed method's effectiveness in handling the complex image-text reasoning problem. Compared to System 1, System 2 performs better on test samples containing 2 or 3 simple proposition sentences, which suggests that the neural-symbolic reasoner can improve the conjunction operation of predic-  tion results of decomposed propositions compared to rule-based methods such as mean pooling.

Case Study
We present two cases in Figure 4 and 5. For the first case (Figure 4), the proposition generator divides the complex text into three proposition sentences, and System 1 inferred the confidence scores (P + 1,2,3 ) of them to ten images. Although these results of simple proposition sentences contain some errors due to having no explicit supervision signal to train, System 2 (neural-symbolic reasoner) could obtain the correct result with logical reasoning operation compared to the rule-based aggregation method in System 1. It indicates the robustness of System 2. In addition, we observe that the pretrained VLMs and System 1, which are capable of perceptual computing, often fail to cover all text semantics. It is easy for them to ignore pivotal text information (such as "there is no text" shown in Figure 5), which leads to inference errors. In conclusion, combining logical reasoning Sys-tem 2 and powerful analogical reasoning System 1 (e.g., pretrained VLMs) has significant potential to take their advantages to address complex reasoning problems.

Conclusion
In this paper, inspired by the divide-and-conquer algorithm and dual-process theory, we introduced an end-to-end neural divide-and-conquer reasoning framework named NDCR to handle the challenging case of image retrievals from linguistically complex text. NDCR contains a proposition generator to divide the compound proposition text into multiple simple proposition sentences, then uses a visuallinguistic interactor to achieve the interaction of simple propositions and images. To improve the logical reasoning capability, we devise a neuralsymbolic reasoner to gain the logical inferring result based on the output of the visual-linguistic interactor. This way, NDCR performs the lowlevel analogically perceptual computing in System 1 (visual-linguistic interactor) and high-level logical reasoning in System 2 (neural-symbolic reasoner). Finally, we combine the output result in Systems 1 and 2 to obtain the final solution.

Limitations
The proposed method NDCR has some limitations as follows: 1) The produced representation of simple proposition sentences in the proposition generator lies in a different space distribution with the image encoding, which affects the performance of their fused representation. Although we introduce the reasoning information of compound proposition text to alleviate this issue, we hope to solve it by improving the text understanding capability of pretrained VLMs. In addition, adopting the pretrained textual encoder of VLMs to perform proposition decomposition is inadequate due to that they present an inferior understanding for the discourse structure of long texts. 2) The performance of samples with highly similar images from video frames is quite different from that of humans. We may improve it from the perspective of image difference modelling.
3) The experimental results indicate that our method is effective at logical inference on examples with medium-length descriptions, but there is still room for improvement for longer descriptions.

Ethics Statement
IMAGECODE (Krojer et al., 2022) is an open data set used for scientific research. For ablation studies in the test set, we hired masters and undergraduate students from the research group to re-annotate the label of the test set. We have informed the creators of the data set and only conducted scientific research.