PV2TEA: Patching Visual Modality to Textual-Established Information Extraction

Information extraction, e.g., attribute value extraction, has been extensively studied and formulated based only on text. However, many attributes can benefit from image-based extraction, like color, shape, pattern, among others. The visual modality has long been underutilized, mainly due to multimodal annotation difficulty. In this paper, we aim to patch the visual modality to the textual-established attribute information extractor. The cross-modality integration faces several unique challenges: (C1) images and textual descriptions are loosely paired intra-sample and inter-samples; (C2) images usually contain rich backgrounds that can mislead the prediction; (C3) weakly supervised labels from textual-established extractors are biased for multimodal training. We present PV2TEA, an encoder-decoder architecture equipped with three bias reduction schemes: (S1) Augmented label-smoothed contrast to improve the cross-modality alignment for loosely-paired image and text; (S2) Attention-pruning that adaptively distinguishes the visual foreground; (S3) Two-level neighborhood regularization that mitigates the label textual bias via reliability estimation. Empirical results on real-world e-Commerce datasets demonstrate up to 11.74% absolute (20.97% relatively) F1 increase over unimodal baselines.


Introduction
Information extraction, e.g., attribute value extraction, aims to extract structured knowledge triples, i.e., (sample_id, attribute, value), from the unstructured information.As shown in Figure 1, the inputs include text descriptions and images (optional) along with the queried attribute, and the output is the extracted value.In practice, textual description has played as the main or only input in mainstream  approaches for automatic attribute value extraction (Zheng et al., 2018;Xu et al., 2019;Wang et al., 2020;Karamanolakis et al., 2020;Yan et al., 2021;Ding et al., 2022).Such models perform well when the prediction targets are inferrable from the text.
As the datasets evolve, interest in incorporating visual modality naturally arises, especially for image-driven attributes, e.g., Color, Pattern, Item Shape.Such extraction tasks rely heavily on visual information to obtain the correct attribute values.The complementary information contained in the images can improve recall in cases where the target values are not mentioned in the texts.In the meantime, the cross-modality information can help with ambiguous cases and improve precision.
However, extending a single-modality task to multi-modality can be very challenging, especially due to the lack of annotations in the new modality.Performing accurate labeling based on multiple modalities requires the annotator to refer to multiple information resources, leading to a high cost of human labor.Although there are some initial explorations on multimodal attribute value extraction (Zhu et al., 2020;Lin et al., 2021;De la Comble et al., 2022), all of them are fully supervised and overlook the resource-constrained setting of building a multimodal attribute extraction framework based on the previous textual-established models.In this paper, we aim to patch the visual modality to attribute value extraction by leveraging textual-based models for weak supervision, thus reducing the manual labeling effort.
Challenges.Several unique challenges exist in visual modality patching: C1.Images and their textual descriptions are usually loosely aligned in two aspects: From the intra-sample aspect, they are usually weakly related considering the rich characteristics, making it difficult to ground the language fragments to the corresponding image regions; From the inter-samples aspect, it is commonly observed that the text description of one sample may also partially match the image of another.As illustrated in Figure 1, the textual description of the mattress product is fragmented and can also correspond to other images in the training data.Therefore, traditional training objectives for multimodal learning such as binary matching (Kim et al., 2021) or contrastive loss (Radford et al., 2021) that only treat the text and image of the same sample as positive pairs may not be appropriate.C2.Bias can be brought by the visual input from the noisy contextual background.The images usually not only contain the interested object itself but also demonstrate a complex background scene.Although the backgrounds are helpful for scene understanding, they may also introduce spurious correlation in a fine-grained task such as attribute value extraction, which leads to imprecise prediction (Xiao et al., 2021;Kan et al., 2021).C3.Bias also exists in language perspective regarding the biased weak labels from textual-based models.As illustrated in Figure 1, the color label of mattress is misled by 'green tea infused' in the text.These noisy labels can be more catastrophic for a multimodal model due to their incorrect grounding in images.Directly training the model with these biased labels can lead to gaps between the stronger language modality and the weaker vision modality (Yu et al., 2021).
Solutions.We propose PV2TEA, a sequence-tosequence backbone composed of three modules: visual encoding, cross-modality fusion and grounding, and attribute value generation, each with a bias-reduction scheme dedicated to the above challenges: S1.To better integrate the loosely-aligned texts and images, we design an augmented labelsmoothed contrast schema for cross-modality fusion and grounding, which considers both the intrasample weak correlation and the inter-sample potential alignment, encouraging knowledge transfer from the strong textual modality to the weak visual one.S2.During the visual encoding, we equip PV2TEA with an attention-pruning mechanism that adaptively distinguishes the distracting background and attends to the most relevant regions given the entire input image, aiming to improve precision in the fine-grained task of attribute extraction.S3.To mitigate the bias from textual-biased weak labels, a two-level neighborhood regularization based on visual features and previous predictions, is designed to emphasize trustworthy training samples while mitigating the influence of textual-biased labels.In this way, the model learns to generate more balanced results rather than being dominated by one modality of information.In summary, the main contributions of PV2TEA are three-fold:

Problem Definition
We consider the task of automatic attribute extraction from multimodal input, i.e., textual descriptions and images.Formally, the input is a query attribute R and a text-image pairs dataset N samples (e.g., products), where I n represents the profile image of X n , T n represents the textual description and c n is the sample category (e.g., product type).The model is expected to infer attribute value y n of the query attribute R for sample X n .We consider the challenging setting with open-vocabulary attributes, where the number of candidate values is extensive and y n can contain either single or multiple values.

Motivating Analysis on the Textual Bias of Attribute Information Extraction
Existing textual-based models or multimodal models directly trained with weak labels suffer from a strong bias toward the texts.As illustrated in Figure 1, the training label for the color attribute of the mattress is misled by 'green tea infused' from the textual profile.Models trained with such textual-shifted labels will result in a learning ability gap between modalities, where the model learns better from the textual than the visual modality.
To quantitatively study the learning bias, we conduct fine-grained source-aware evaluations on a real-world e-Commerce dataset with representative unimodal and multimodal methods, namely OpenTag (Zheng et al., 2018) with the classification setup and PAM (Lin et al., 2021).Specifically, for each sample in the test set, we collect the source of the gold value (i.e., text or image).Experiment results are shown in Figure 2, where label Source: Text indicates the gold value is present in the text, while label Source: Image indicates the gold value is absent from the text and must be inferred from the image.It is shown that both the text-based unimodal extractor and multimodal extractor achieve impressive results when the gold value is contained in the text.However, when the gold value is not contained in the text and must be derived from visual input, the performance of all three metrics drops dramatically, indicating a strong textual bias and dependence of existing models.

PV2TEA
We present the backbone architecture and three bias reduction designs of PV2TEA, shown in Figure 3.The backbone is formulated based on visual question answering (VQA) composed of three modules: (1) Visual Encoding.We adopt the Vision Transformer (ViT) (Dosovitskiy et al., 2021) as the visual encoder.The given image I n is divided into patches and featured as a sequence of tokens, with a special token [CLS-I]appended at the head of the sequence, whose representation v cls n stands for the whole input image I n .
(2) Cross-Modality Fusion and Grounding.Following the VQA paradigm, we define the question prompt as "What is the R of the c n ?", with a special token [CLS-Q] appended at the beginning.A unimodal BERT (Devlin et al., 2019) encoder is adopted to produce token-wise textual representation from sample profiles (title, bullets, and descriptions).The visual representations of P image patches v n = [v cls n , v 1 n , . . ., v P n ] are concatenated with the textual representation of T tokens t n = [t cls n , t 1 n , . . ., t T n ], which is further used to perform cross-modality fusion and grounding with the question prompt through cross-attention.The output q n = [q cls n , q 1 n , . . ., q Q n ] is then used as the grounded representation for the answer decoder.
(3) Attribute Value Generation.We follow the design from (Li et al., 2022a), where each block of the decoder is composed of a causal self-attention layer, a cross-attention layer, and a feed-forward network.The decoder takes the grounded multimodal representation as input and predicts the attribute value ŷn in a generative manner2 .Training Objectives.The overall training objective of PV2TEA is formulated as where the three loss terms, namely augmented label-smoothed contrastive loss L sc (Section 3.1), category aware ViT loss L ct (Section 3.2), and neighborhood-regularized mask language modeling loss L r-mlm (Section 3.3) correspond to each of the three prementioned modules respectively.

Augmented Label-Smoothed Contrast for
Multi-modality Loose Alignment (S1) Contrastive objectives have been proven effective in multimodal pre-training (Radford et al., 2021) by minimizing the representation distance between different modalities of the same data point while keeping those of different samples away (Yu et al., 2022).However, for attribute value extraction, the image and textual descriptions are typically loosely aligned from two perspectives: (1) Intra-sample weak alignment: The text description may not necessarily form a coherent and complete sentence, but a set of semantic fragments describing multiple facets.Thus, grounding the language to corresponding visual regions is difficult.
(2) Potential inter-samples alignment: Due to the commonality of samples, the textual description of one sample may also correspond to the image of another.Thus, traditional binary matching and contrastive objectives become suboptimal for these loosely-aligned texts and images.
To handle the looseness of images and texts, we  augment the contrast to include sample comparison outside the batch with two queues storing the most recent M (M ≫ batch size B ) visual and textual representations, inspired by the momentum contrast in MoCo (He et al., 2020) and ALBEF (Li et al., 2021).For the intra-sample weak alignment of each given sample X n , instead of using the onehot pairing label p i2t n , we smooth the pairing target with the pseudo-similarity q i2t n , where α is a hyper-parameter and q i2t n is calculated by softmax over the representation multiplication of the [CLS] tokens, v ′ cls n and t ′ cls n , from momentum unimodal encoders F ′ v and F ′ t , For potential inter-samples pairing relations, the visual representation v ′ cls n is compared with all textual representations T ′ in the queue to augment contrastive loss.Formally, the predicted image-totext matching probability of X n is With the smoothed targets from Equation ( 2), the image-to-text contrastive loss L i2t is calculated as the cross-entropy between the smoothed targets p i2t n and contrast-augmented predictions d i2t n , and vise versa for the text-to-image contrastive loss L t2i .Finally, the augmented label-smoothed contrastive loss L sc is the average of these two terms,

Visual Attention Pruning (S2)
Images usually contain not only the visual foreground of the concerned category but also rich background contexts.Although previous studies indicate context can serve as an effective cue for visual understanding (Doersch et al., 2015;Zhang et al., 2020;Xiao et al., 2021), it has been found that the output of ViT is often based on supportive signals in the background rather than the actual object (Chefer et al., 2022).Especially in a fine-grained task such as attribute value extraction, the associated backgrounds could distract the visual model and harm the prediction precision.For example, when predicting the color of birthday balloons, commonly co-occurring contexts such as flowers could mislead the model and result in wrongly predicted values.
To encourage the ViT encoder F focus on taskrelevant foregrounds given the input image I n , we add a category-aware attention pruning schema, supervised with category classification, In real-world information extraction tasks, 'category' denote classification schemas for organizing and structuring diverse data, exemplified by the broad range of product types in e-commerce, such as electronics, clothing, or books.These categories not only display vast diversity but also have distinct data distributions and properties, adding layers of complexity to the information extraction scenarios.
The learned attention mask M in ViT can gradually resemble the object boundary of the interested category and distinguishes the most important task-related regions from backgrounds by assigning different attention weights to the image patches (Selvaraju et al., 2017).The learned M is then applied on the visual representation sequences v n of the whole image, to screen out noisy background and task-irrelevant patches before concatenating with the textual representation t n for further cross-modal grounding.

Two-level Neighborhood-regularized Sample Weight Adjustment (S3)
Weak labels from established models can be noisy and biased toward the textual input.Directly training the models with these labels leads to a learning gap across modalities.Prior work on self-training shows that embedding similarity can help to mitigate the label errors issue (Xu et al., 2023;Lang et al., 2022).Inspired by this line of work, we design a two-level neighborhood-regularized sample weight adjustment.In each iteration, sample weight s (X n ) is updated based on its label reliability, which is then applied to the training objective of attribute value generation in the next iteration, where g measures the element-wise cross entropy between the training label y n and the prediction ŷn .
As illustrated by the right example in Figure 3 3 , where green arrows point to samples with the same training label as y n , and red arrows point to either visual or prediction neighbors, a higher consistency between the two sets indicates a higher reliability of y n , formally explained as below: (1) Visual Neighbor Regularization.The first level of regularization is based on the consistency between the sample set with the same training label y n and visual feature neighbors of X n .For each sample X n with visual representation v n , we adopt the K-nearest neighbors (KNN) algorithm to find its neighbor samples in the visual feature space: where KNN (v n , D, K) demotes K samples in D with visual representation nearest to v n .Simultaneously, we obtain the set of samples in D with the same training label y j as that of the sample X n , The reliability of sample X n based on the visual neighborhood regularization is (2) Prediction Neighbor Regularization.The second level of regularization is based on the consistency between the sample set with the same training label and the prediction neighbors from the previous iteration, which represents the learned multimodal representation.Prediction regularization is further added after E epochs when the model can give relatively confident predictions, ensuring the predicted values are qualified for correcting potential noise.Formally, we obtain the set of samples in D whose predicted attribute value p j from the last iteration is the same as that of the sample X n , With the truth-value consensus set Y n from Equation ( 11), the reliability based on previous prediction neighbor regularization of the sample X n is Overall, s(X n ) is initially regularized with visual neighbors and jointly with prediction neighbors after E epochs when the model predicts credibly, labels on the benchmark dataset to ensure preciseness.Besides, the label sources are marked down, indicating whether the attribute value is present or absent in the text, to facilitate fine-grained sourceaware evaluation.The human-annotated benchmark datasets will be released to encourage the future development of modality-balanced multimodal extraction models.See Appendix A for the implementation and computation details of PV2TEA.

Evaluation Protocol
We use Precision, Recall, and F1 score based on synonym normalized exact string matching.For single value type, an extracted value ŷn is considered correct when it exactly matches the gold value string y n .For multiple value type where the gold values for the query attribute R can contain multiple answers y n ∈ y 1 n , . . ., y m n , the extraction is considered correct when all the gold values are matched in the prediction.Macro-aggregation is performed across attribute values to avoid the influence of class imbalance.All reported results are the average of three runs under the best settings.

Baselines
We compare our proposed model with a series of baselines, spanning unimodal-based methods and multimodal-based ones.For unimodal baselines, OpenTag (Zheng et al., 2018)   modal generative model with the same architecture as PV2TEA but without the image patching, which is included to demonstrate the influence of the generation setting.For multimodal baselines, we consider discriminative encoder models, including ViL-BERT (Lu et al., 2019), LXMERT (Tan and Bansal, 2019) with dual encoders, and UNITER (Chen et al., 2020) with a joint encoder.We also add generative encoder-decoder models for comparisons.BLIP (Li et al., 2022a) adopts dual encoders and an image-grounded text decoder.PAM (Lin et al., 2021) uses a shared encoder and decoder separated by a prefix causal mask.
5 Experimental Results Color, where multimodal models can even double that of text-only ones.However, the lower precision performance of the multimodal models implies the challenges beneath cross-modality integration.With the three proposed bias-reduction schemes, PV2TEA improves on all three metrics over multimodal baselines and balances precision and recall to a great extent compared with unimodal models.Besides the full PV2TEA, we also include three variants that remove one proposed schema at a time.It shows that the visual attention pruning module mainly helps with precision while the other two benefit both precision and recall, leading to the best F 1 performance when all three schemes are equipped.We include several case studies in Section 5.3 for qualitative observation.

Overall Comparison
Source-Aware Evaluation.To investigate how the modality learning bias is addressed, we conduct fine-grained source-aware evaluation similarly to Section 2.2, as shown in Table 3 4 .The performance gap between when the gold value is present or absent in the text is significantly reduced by PV2TEA when compared to both unimodal and multimodal representative methods, which suggests a more balanced and generalized capacity of PV2TEA to learn from different modalities.When the gold value is absent in the text, our method outperforms OpenTag cls by more than twice as much on recall, and also outperforms on precision under various scenarios compared to the multimodal PAM.

Ablation Studies
Augmented Label-Smoothed Contrast.We look into the impact of label-smoothed contrast on both single-and multiple-value type datasets5 .Table 4: Ablation study on the augmented labelsmoothed contrast for cross-modality alignment (%).
Figure 4: The influence study of alignment objectives, i.e., binary matching v.s.contrastive loss, and the influence of softness α via the task of image-to-text and text-to-image retrieval.The metric T/I@1 is the recall of text/image retrieval at rank 1, T/I@M means the rank average, and R@Mean further averages T@M and I@M.
4 shows that removing the contrastive objective leads to a drop in both precision and recall.For the multiple-value dataset, adding the contrastive objective significantly benefits precision, suggesting it encourages cross-modal validation when there are multiple valid answers in the visual input.With label smoothing, the recall can be further improved.This indicates that the augmented and smoothed contrast can effectively leverage the cross-modality alignment inter-samples, hence improving the coverage rate when making predictions.
In addition, we conduct cross-modality retrieval to study the efficacy of aligning objectives, i.e., binary matching and contrastive loss, for crossmodality alignment and the influence of the softness α, as shown in Figure 4. Across different datasets and metrics, the contrastive loss consistently outperforms the binary matching loss.This consolidates our choice of contrasting objectives and highlights the potential benefits of labelsmoothing and contrast augmentation, given that both are neglected in a binary matching objective.Retrieval performance under different smoothness values shows a trend of first rising and then falling.We simply take 0.4 for α in our experiments.Category Aware Attention Pruning.We study    the influence of the category aware attention pruning, as shown in Table 5.The results imply that adding the category classification helps to improve precision performance without harming recall, and the learned attention mask can effectively highlight the foreground regions of the queried sample.information extractor.The results are demonstrated in Table 7.It is shown that the setting of generation achieves significant advantages over classification.Especially on the recall performance for multi-value type attribute Color, where the gold value can be multiple, the improvement of recall can be up to 20% relatively.This indicates that the generation setting can extract more complete results from the multimodal input, leading to a higher coverage rate.Therefore, we choose the generation setting in the attribute value extraction module in the final architecture design of PV2TEA.To qualitatively observe the extraction performance, we attach several case studies in Figure 6.It shows that even when the attribute value is not contained in the text, PV2TEA can still perform the extraction reliably from images.In multiple value datasets such as Color, PV2TEA can effectively differentiate related regions and extract multiple values with comprehensive coverage.

Related Work
Attribute Information Extraction.Attribute extraction has been extensively studied in the literature primarily based on textual input.Open-Tag (Zheng et al., 2018) formalizes it as a sequence tagging task and proposes a combined model leveraging bi-LSTM-CRF, and attention to perform end-to-end tagging.Xu et al. (2019) scales the sequence-tagging-based model with a global set of BIO tags.AVEQA (Wang et al., 2020) develops a question-answering model by treating each attribute as a question and extracting the best answer span from the text.TXtract (Karamanolakis et al., 2020) uses a hierarchical taxonomy of categories and improves value extraction through multitask learning.AdaTag (Yan et al., 2021) exploits an adaptive CRF-based decoder to handle multiattribute value extractions.Additionally, there have been a few attempts at multimodal attribute value extraction.M-JAVE (Zhu et al., 2020) introduces a gated attention layer to combine information from the image and text.PAM (Lin et al., 2021) proposes a transformer-based sequence-to-sequence generation model for multimodal attribute value extraction.Although the latter two use both visual and textual input, they fail to account for possible modality bias and are fully supervised.Multi-modality Alignment and Fusion.The goal of multimodal learning is to process and relate information from diverse modalities.CLIP (Radford et al., 2021) makes a gigantic leap forward in bridging embedding spaces of image and text with contrastive language-image pretraining.AL-BEF (Li et al., 2021) applies a contrastive loss to align the image and text representation before merging with cross-modal attention, which fits looselyaligned sample image and text.Using noisy picture alt-text data, ALIGN (Jia et al., 2021) jointly learns representations applicable to either visiononly or vision-language tasks.The novel Vision-Language Pre-training (VLP) framework established by BLIP (Li et al., 2022a) is flexibly applied to both vision-language understanding and generation tasks.GLIP (Li et al., 2022b) offers a grounded language-image paradigm for learning semantically rich visual representations.FLAVA (Singh et al., 2022) creates a foundational alignment that simultaneously addresses vision, language, and their interconnected multimodality.Flamingo (Alayrac et al., 2022) equips the model with in-context fewshot learning capabilities.SimVLM (Wang et al., 2022b) is trained end-to-end with a single prefix language modeling and investigates large-scale weak supervision.Multi-way Transformers are introduced in BEIT-3 (Wang et al., 2022a) for generic modeling and modality-specific encoding.

Conclusion
In this work, we propose PV2TEA, a bias-mitigated visual modality patching-up model for multimodal information extraction.Specifically, we take attribution value extraction as an example for illustration.Results on our released sourceaware benchmarks demonstrate remarkable improvements: the augmented label-smoothed contrast promotes a more accurate and complete alignment for loosely related images and texts; the visual attention pruning improves precision by masking out task-irrelevant regions; and the neighborhoodregularized sample weight adjustment reduces textual bias by lowering the influence of noisy samples.We anticipate the investigated challenges and proposed solutions will inspire future scenarios where the task is first established on the text and then expanded to multiple modalities.

Limitations
There are several limitations that can be considered for future improvements: (1) In multimodal alignment and fusion, we only consider a single image for each sample, whereas multiple images can be available.A more flexible visual encoding architecture that can digest an indefinite number of input images can improve the visual information coverage; 2) The empirical results in this work focus on three attribute extraction datasets (i.e., item form, color, and pattern) that can clearly benefit from visual perspectives, while there are also various attribute types that rely more on the textual input.Different traits of attributes may influence the preferred modalities during the modeling, which is out of scope for this work but serves as a natural extension of this study; 3) Currently there is no specific design to improve the efficiency based on the visual question answering architecture.It can be not scalable as the number of attributes increases.
There could be a dual-use regarding the attentionpruning mechanism, which can be a potential risk of this work that could arise and harm the result.The attention-pruning mechanism encourages the model to focus on the task-relevant foreground on the given image selected with category supervision, which can improve the prediction precision given the input image is visually rich and contains noisy context background.While for some types of images, such as infographics, there may be helpful text information on the images or intentionally attached by providers.These additional texts may be overlooked by the attention-pruning mechanism, resulting in potential information losses.A possible mitigation strategy is to add an OCR component along with the visual encoder to extract potential text information from given images.

Ethics Statement
We believe this work has a broader impact outside the task and datasets in the discussion.The studied textual bias problem in our motivating analysis and the potential of training a multimodal model with weakly-supervised labels from textestablished models are not restricted to a specific task.Also, it becomes common in the NLP domain that some tasks first established based on pure text input are expected to further include the consideration multimodal input.The discussion in this work can be generalized to a lot of other application scenarios.The proposed solutions for multimodal integration and modality bias mitigation are independent of model architecture, which we expect can be applied to other downstream tasks or inspire designs with similar needs.
Regarding the human annotation involved in this work, we create three benchmark datasets that are manually labeled by human laborers to facilitate the source-aware evaluation.The annotation includes both gold attribute value as well as label sources, i.e., image or text.The profiles and images are all collected based on the publicly accessible Amazon shopping website.We depend on internal qualityassured annotators with balanced demographic and geographic characteristics, who consent and are paid adequately based in the US.The data collection protocol is approved by the ethics review board.We attach detailed human annotation instructions and usage explanations provided to the annotators in Appendix F for reference.

A Implementation Details
Our models are implemented with PyTorch (Paszke et al., 2019) and Huggingface Transformer library and trained on an 8 Tesla V100 GPU node.The model is trained for 10 epochs, where the Item Form dataset takes around 12 hours, the Color dataset takes about 32 hours, and the Pattern dataset needs around 35 hours to run on a single GPU.The overall architecture of PV2TEA consists of 361M trainable parameters, where a ViT base (Dosovitskiy et al., 2021) is used as the image encoder and initialized with the pre-trained model on ImageNet of 85M parameters, and the text encoder is initialized from BERT base (Devlin et al., 2019) of 123M parameters.We use AdamW (Loshchilov and Hutter, 2019) as the optimizer with a weight decay of 0.05.The learning rate of each parameter group is set using a cosine annealing schedule (Loshchilov and Hutter, 2016) with the initial value of 1e-5.The model is trained for 10 epochs, with both training and testing batch sizes of 8.The memory queue size M is set as 57600 and the temperature τ of in Equation 4is set as 0.07.We performed a grid search for the softness α from [0, 0.2, 0.4, 0.6, 0.8] and used the best-performed 0.4 for reporting the final results.The K for two-level neighborhood regularization is set at 10.The input textual description is cropped to a maximum of 100 words.The input image is divided into 30 by 30 patches.The hidden dimension of both the visual and textual encoders is set to 768 to produce the representations of patches, tokens, or the whole image/sequence.The epoch E for adding the second-level prediction neighbor regularization to reliability score s (X n ) is set as 2.  The source-aware evaluation of the Color and Pattern datasets is shown in Table 8.We can observe that similarly to the discussions in Section 5.1, com-pared with the baselines, the proposed PV2TEA effectively mitigates the performance gap of F 1 when the gold value is not contained in the text.More specifically, we observed that compared with the unimodal method, PV2TEA mainly reduces the recall performance gap across modalities, while compared with the multimodal method, the reduction happens mainly in precision, which all corresponds to the weaker metrics for each type of method.This indicates the stronger generalizability and more balanced learning ability of PV2TEA.

C Ablation Studies on Pattern Dataset
We further include the ablation results on the singlevalue type dataset Pattern for each proposed mechanism in Table 9, Table 10, and Table 11, respectively.The observations are mostly consistent with the discussion in section 5.2, where all three proposed mechanisms support improvements in the overall performance of F 1 .It is noted that the recall performance with attention-pruning drops a bit compared with that without.This may indicate potential information losses on the challenging dataset such as Pattern with only the selected foreground.We discuss this potential risk in detail in the Limitation section.

D Retrieval Ablation on Pattern Dataset
Similar to Figure 4, we also demonstrate the crossmodality retrieval results on the pattern dataset in Figure 8.The conclusion is consistent with our observations mentioned in Section 5.2, where the contrastive objective demonstrates advantages in cross-modal alignment and fusion, and the best smoothness choice peaks at 0.4.

E Visualizations of Attention Pruning
Examples of visualization on the learned attention mask are demonstrated in Figure 7.It is observed that the visual foreground is highlighted under the supervision of category classification, which potentially encourages a higher prediction precision for fine-grained tasks like attribute extraction, as proved by the experimental results.

F Human Annotation Instruction
We create source-aware fine-grained datasets with internal human annotators.Below are the instruction texts provided to annotators: The annotated attribute values are used for research model development of multimodal attribute information extraction and fine-grained error anal-ysis.The datasets are named source-aware multimodal attribute extraction evaluation benchmarks and will be released to facilitate public testing and future studies in bias-reduced multimodal attribute value extraction model designs.All the given sample profiles (title, bullets, and descriptions) and images are collected from the public amazon.comweb pages, so there is no potential legal or ethical risk for annotators.Specifically, the annotation requirements compose two tasks in order: (1) Firstly, for each given sample_id in the given ASINs set, first determine the category of the sample by referring to ID2Category.csvmapping file, then label the gold value for the queried attribute by selecting from the candidates given the category.The annotation answer candidates for the Item Form dataset can be referred to in Table 12.Note that this gold value annotation step requires reference to both sample textual titles, descriptions, and images; (2) For each annotated ASIN, mark down which modality implies the gold value with an additional source label, with different meanings as below: • 0: the gold attribute value can be found in text.
• 1: the gold attribute value cannot be inferred from the text but can be found in the image.The annotated attribute values and source labels are assembled in fine-grained source-aware evaluation.

Textual
Descriptions: "Best Price Mattress 12 Inch Memory Foam Mattress, Calming Green Tea-Infused Foam, Pressure Relieving, Bed-in-a-Box, Queen" Question: What is the color of the mattress?Weakly Supervised Label: green True Value: white Image ⋯ Challenge Explanations: C1 Loosely-aligned image and textual descriptions: • intra-sample: weakly related across modalities and difficult to ground • inter-samples: images of other samples can also pair with this text C2 Visual bias: noisy contextual backgrounds, e.g., pillow, bed frame, etc. C3 Textual bias: the training label is misled/biased by 'green tea' in text

Figure 1 :
Figure 1: Illustration of multimodal attribute extraction and the challenges in cross-modality integration.

Figure 2 :
Figure 2: Source-aware evaluation of existing unimodal and multimodal models on the textual-biased issue.

Figure 3 :
Figure 3: The overview of PV2TEA model architecture with three modules, where each of them is equipped with a bias reduction scheme corresponding to the discussed challenges in Figure 1.

Figure 5 :
Figure 5: Visualization of learned attention mask with category (e.g., product type) aware ViT classification.
Figure 5 presents several visualizations of the learned attention mask.Neighborhood Regularization.We consider the influence of the two-level neighborhood regularization by removing the visual neighborhood regularization (Vis-NR), prediction neighborhood regularization (Pred-NR), or both (NR) from the full model.Results in

Product
Figure 7: Visualization examples of the learned category aware attention pruning mask.

Figure 8 :
Figure 8: The influence study of alignment objectives, i.e., binary matching v.s.contrastive, and softness α study via cross-modality retrieval on the Pattern dataset.

Visual Encoding CLS-I … Image Encoder Cross-Modality Fusion and Grounding Cross-Attn Bi Self-Attn FF 𝒬
: "What is the color of the mattress?"

Table 1 :
Statistics of the attribute extraction datasets.
3 See Appendix G for additional demo examples.

Table 2 :
Performance comparison with different baselines (%).The performance gains over the baselines have passed the t-test with a p-value<0.05.The best performance is in bold, and the second runner baseline is underlined.
(Zalmout and Li, 2022)ges as weak training labels instead of highly processed data.We follow the same filtering strategy from prior text established work(Zalmout and Li, 2022)to denoise training data.For the testing, we manually annotate gold is considered a strong text-based model for attribute extraction.OpenTag seq formulates the task as sequence tagging and uses the BiLSTM-CRF architecture with self-attention.OpenTag cls replaces the BiLSTM encoder with a transformer encoder and tackles the task as classification.TEA is another text-only uni-

Table 3 :
Fine-grained source-aware evaluation of different methods.The gold value source indicates whether the gold value is contained in the text, or is not contained in the text and must be inferred from the image.
Table2shows the performance comparison of different types of extraction methods.It is shown that PV2TEA achieves the best F 1 performance, especially compared to unimodal baselines, demon-strating the advantages of patching visual modality to this text-established task.Comparing the unimodal methods with multimodal ones, textualonly models achieve impressive results on precision while greatly suffering from low recall, which indicates potential information loss when the gold value is not contained in the input text.With the generative setting, TEA sort of mitigates the information loss and improves recall over OpenTag under the tagging and classification settings.Besides, adding visual information can further improve recall, especially for the multi-value attribute

Table 6 :
Ablation study on the two-level neighborhoodregularized sample weight adjustment (%).
Classification vs. Generation To determine which architecture is better for multimodal attribute value extraction, we compare the generation and classification settings for the module of the attribute Table6show all the metrics decrease when both regularizations are removed, indicating the validity of the proposed neighborhood regularized sample weight adjustment in mitigating the influence of hard, noisy samples.Besides, since the second-level prediction-based neighbor regularization is independent of the multimodal extraction framework, it can be incorporated flexibly into other frameworks as well for future usage.

Table 7 :
Attribute extraction performance comparison between the settings of classification and generation.

Table 8 :
Fine-grained source-aware evaluation for the Color and Pattern datasets.

Table 10 :
Ablation study on the category supervised visual attention pruning (%).

Table 12 :
The annotation candidates provided to annotators given each sample type on the Item Form dataset.