Visually-Enhanced Phrase Understanding

Large-scale vision-language pre-training has exhibited strong performance in various visual and textual understanding tasks. Recently, the textual encoders of multi-modal pre-trained models have been shown to generate high-quality textual representations, which often out-perform models that are purely text-based, such as BERT. In this study, our objective is to utilize both textual and visual encoders of multi-modal pre-trained models to enhance language understanding tasks. We achieve this by generating an image associated with a textual prompt, thus enriching the representation of a phrase for downstream tasks. Results from experiments conducted on four benchmark datasets demonstrate that our proposed method, which leverages visually-enhanced text representations, signiﬁcantly improves performance in the entity clustering task. 1


Introduction
Recent advances in vision-language pre-training have seen the successful alignment of visual and linguistic inputs through the implementation of cross-modal pre-training objectives, such as language modeling and contrastive learning (Lu et al., 2019;Radford et al., 2021). These pre-trained models have shown impressive performance on downstream vision-language tasks, validating their crossmodal capabilities (Su et al., 2019).
While most previous studies focused on multimodal tasks, researchers have shown that pretrained cross-modal encoders are equally proficient at uni-modal language understanding, matching the performance of pre-trained text encoders. Lu et al. (2022) were the pioneers in utilizing machine abstract imagination from pre-trained crossmodal encoders, demonstrating improvement on general NLU tasks. Yan et al. (2022) established * Equal contribution 1 Source code: https://github.com/MiuLab/VisualLU that the text encoder of CLIP (Radford et al., 2021) surpasses models designed for producing phrase representations, including Phrase- BERT (Wang et al., 2021) and UCTopic (Li et al., 2022a). They hypothesized that the visual supervision during pre-training empowers CLIP to produce visuallygrounded phrase representations, beneficial for language-only tasks. Such a phenomenon aligns with neuroscience studies, demonstrating that visual and linguistic semantic representations are coordinated in the human brain (Popham et al., 2021).
Despite the strong performance of the previous method, it only utilized the text encoder of a crossmodal pre-trained model. In contrast, our study aims to exploit its multi-modal representation capacity, incorporating both text and image encoders. We introduce a visually-enhanced phrase understanding framework to exploit multiple modalities for uni-modal tasks. Our framework comprises a text-to-image generator and a text-image crossmodal encoder. We employ a text-to-image generator to produce visual cues for a textual candidate. Subsequently, the generated image and the textual prompt are processed by the cross-modal encoder to create visually-enhanced phrase embeddings. Unlike Lu et al. (2022), our method does not require supervised data for downstream tasks, making it more scalable. Our approach also differs from VOKEN (Tan and Bansal, 2020), as they generated visual cues in tokens and processed the signal solely on the language side, whereas we employ representations directly from different modalities. Therefore, our model can capture more abstract concepts from images, enhancing generalizability.
We evaluate our approach on four benchmark phrase understanding datasets. The experiments demonstrate that our proposed visual enhancement significantly outperforms all text-only baselines, demonstrating that abstract visual concepts can provide complementary cues for text understanding.  Figure 1: Illustration of the proposed framework.

Method
Our proposed method is illustrated in Figure 1, where we first generate images associated with phrases using a text-to-image diffusion model. Following this, we utilize pre-trained text and image encoders to construct visually-enhanced phrase embeddings for downstream understanding tasks.

Text-To-Image Model
Recently, text-to-image models have attracted significant interest. Among these, diffusion models have played an important role in text-to-image generation, showing impressive performance. To more effectively generate visual cues associated with texts, this study adopts stable diffusion (Rombach et al., 2022) as our image generation model. During the training phase, an image autoencoder is trained using an extensive image database. A time-conditional U-Net (Long et al., 2015) forms the core of the diffusion model, learning to denoise image latent representations incrementally.
In the sampling procedure, we first obtain a text prompt and derive a text embedding from the text encoder. Subsequently, we use Gaussian noise as the latent representation, and progressively denoise the latent representation via the diffusion model and a scheduler algorithm. Ultimately, an image is generated by reconstructing the latent representation through the image decoder.

CLIP (Contrastive Language-Image
Pretraining) CLIP (Radford et al., 2021) is a large-scale visionlanguage pre-training model using contrastive learning, which achieves remarkable performance in zero-shot image classification tasks. Given a batch of data D, CLIP jointly trains an image encoder and a text encoder to maximize the similarities of |D| paired text-image representations while minimizing the similarities of other (|D| 2 −|D|) unpaired text-image representations. Given the weak alignment between texts and images, this study employs the pre-trained CLIP text encoder E text and image encoder E image to extract meaningful cues from different modalities. Our experiments focus on showing that the pre-trained CLIP encoders provide superior visual enhancement for texts, compared to separately pre-trained text and image encoders.

Visually-Enhanced Multimodal Representation
Given a text sequence with an entity candidate phrase p, we design our text prompt as "A photo of <p>", a proven effective default template that delivers robust zero-shot classification performance (Radford et al., 2021). As depicted in Figure 1, we initially use the text prompt to generate a text-associated image with the text-to-image model G. Following this, we employ the pre-trained text and image encoders of CLIP to extract corresponding representations r i (p) and r t (p) as follows.
Lastly, we concatenate the two embeddings originating from different modalities to create visuallyenhanced phrase embeddings, which potentially capture richer and more comprehensive information and thus benefit downstream tasks.

Experiments
To evaluate whether our visually-enhanced phrase embeddings provide improved semantic cues, we conduct a series of experiments focused on entity clustering, as our primary task is to categorize entity candidates with similar concepts only based on phrase representations in an unsupervised fashion.

Setup
Our experiments are conducted on four diverse datasets, each with annotated entities from various domains: • CoNLL2003 (Sang and De Meulder, 2003) comprises 20,744 sentences, incorporating four types of entities: persons (PER), organizations (ORG), locations (LOC), and miscellaneous names (MISC). • BC5CDR (Li et al., 2016) is formed from 1,500 PubMed articles and contains chemical and disease entities. • W-NUT 2017 (Derczynski et al., 2017) is collected from public platforms, including YouTube and Twitter, with a focus on identifying previously unseen entities in emerging discussions. It includes six types of entities. • MIT-Movie (Liu et al., 2013) contains 12,218 sentences featuring title and person entities. Following previous research (Xu et al., 2017;Li et al., 2022b;Yan et al., 2022), we implement K-means clustering on the cross-modal representations to perform unsupervised phrase understanding tasks. In this setup, the number of clusters is set to the number of classes present in the dataset.
The Hungarian algorithm (Papadimitriou and Steiglitz, 1998) is employed to optimally allocate each cluster to a class.
To evaluate the quality of the representations and compare them fairly with the previous work, we employ accuracy (ACC) and normalized mutual information (NMI) as our evaluation metrics. The results reported are averages over five separate clustering runs. For our proposed image and text-image approaches, we conduct runs over three seeds for diffusion models to generate images.

Baselines
We position our model in comparison to various language models and phrase understanding models to validate the effectiveness of our cross-modal framework. The used representations are the same as described in the prior work. •

Results
The evaluation results are presented in Table 1.
Our proposed visually-enhanced representations outperform all baselines on the CoNLL2003 and MIT-Movie datasets, while achieving competitive performance on the BC5CDR and W-NUT 2017 datasets. Moreover, solely utilizing image representations encoded from generated images yields a higher average ACC than all the baselines. This suggests that the visual signal offers valuable cues for enhanced phrase understanding. Hence, we conclude that integrating different modalities can effectively augment phrase representations. For a more granular understanding, we provide detailed scores across multiple turns in Table 2. The lower standard deviation of our proposed text-image approach indicates superior stability.

Analysis of Different Encoders
To further investigate whether the CLIP encoders, pre-trained jointly, are more effective for visual enhancement, we compare them with image and 4 We take the average of the last layer in Transformer as our phrase-associated representations. 5 We take the pooling of the entity-associated vectors based on https://github.com/JiachengLi1995/UCTopic. 6 We take [EOT] of the last Transformer layer's output as phrase representations. text encoders that have been pre-trained individually. Table 3 presents the experimental results, where we substitute the text and image encoders of CLIP with RoBERTa-base and ViT-B/32 respectively. We notice that phrase representations augmented by ViT-B/32 outperform textual representations, which suggests the richness of information drawn from multiple modalities. It is evident that CLIP encoders surpass individually pre-trained encoders, implying that text and image encoders, when pre-trained together, can more effectively enrich phrase representations by integrating text and image at the representation level.

Contextual Prompt
Previous work (Yan et al., 2022) demonstrated that enriching phrase candidates with a large pre-trained language model can yield more domain-specific keywords for textual prompts. Specifically, given a phrase p, the prompt "p is a [MASK]" is fed into a language model, which in turn returns the top K predictions {m 1 , m 2 , . . . , m K } for the [MASK] token. Subsequently, we formulate the contextual prompt as "A photo of p, a m 1 , m 2 , . . . , m K ." In this paper, we set K to 3 for the contextual prompts. Table 1 shows that the addition of such contextual prompts enhances the performance of text-only baselines.
We further probe into whether a contextual prompt can boost our performance and present the results in Table 4. Our observation is that utilizing contextual prompts for text embeddings yields comparable performance, indicating that our visual cues already encompass the domain-specific signal. We hypothesize that generating images from  Table 4: The utility of contextual prompt. Vanilla: "A photo of p."; Contextual: "A photo of p, a m 1 , m 2 , m 3 ." (p is the entity and m 1 , m 2 , m 3 are the keywords of p.) contextual prompts may introduce more noise, resulting in difficulty encoding effective visual representations for phrase understanding. Notably, our baseline setting already achieves significantly improved performance compared with earlier work utilizing additional keywords, demonstrating the informativeness of our cross-modal representations.

Qualitative Analysis
To further examine how our visual cues enhance text understanding, we present several generated images along with their understanding results in Figure 2. Previous work, CLIP Text, incorrectly classifies "Mpumulanga" and "Golan" as PER (persons). However, with the visual cues generated in our model, shown in Figure 2(a-b), we can correctly classify them as LOC (locations). The images generated by our model, displayed in Figure 2(c-f), further enrich the phrase representations and better understand the concepts. This demonstrates the effectiveness of our multi-modal framework. However, there are cases where the generated image may lead to incorrect categorization, as is the case with "BAYERISCHE VEREINSBANK" in Figure 2(g). The image misled the categorization process, changing the cluster from the correct classification (ORG, or organization) to an incorrect one (LOC, or location). Figure 2(h) displays an instance where the generated image does not provide useful visual information for an unusual entity, and the incorrect classification (group) persists. Therefore, there is still room for enhancement in future work.

Conclusion
This work presents a multi-modal framework that leverages a text-to-image model to bridge between language and visual modalities for enhancing text comprehension. The model effectively transforms text inputs into coherent images, enriching phrase representations by merging outputs from different modalities. Experimental results show our framework surpassing robust phrase understanding models across diverse domains.

Limitations
Due to the maximum input length constraint of both the CLIP text encoder and the text-to-image model, we are unable to process long texts. We are interested in exploring alternative prompt configurations to circumvent this limitation. Our methodology is readily extendable to these settings, making it an intriguing area of study.

Ethics Statement
Our approach leverages a pre-trained text-to-image model to visually enhance representations. However, the text-to-image model may carry over biases and improper content from its training data. This necessitates additional analyses to safeguard against any undue influence of these biases on our method.

B Implementation Details
In our work, we use the Huggingface models to generate all the representations: • BERT/RoBERTa: We take pooler_output as the representations, where pooler_output is the classification token after processing through a linear layer and an activation function. 7 The linear layer weights are learned by next sentence prediction during pre-training. • LUKE: entity_last_hidden_states is used as the representation, which is the last hidden states of the input entity. 8 • Phrase-BERT: Phrase representations can be easily acquired by calling model.encode(). 9 • UCTopic: We obtain the phrase representations with the released source code. 10 • CLIP: pooler_output is taken as the representation for both the text encoder 11 and the image encoder. 12

D Inference Details
We conduct our experiments on single V100 GPU.
• Generation time of stable diffusion v2-base with respect to inference steps is elaborated in Appendix E. • Each clustering experiment takes no more than 10 minutes to run.

E Efficiency vs. Efficacy
Results over different inference steps of stable diffusion v2-base are shown in Table 5. It took 0.84 seconds per image for inference step 10, 2.02 seconds per image for inference step 30, and 3.24 seconds per image for inference step 50. The balance between efficiency and efficacy depends on application usage. B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified? For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?
We have followed the licenses and don't have a conflict with the artifacts' intended use.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?
We have checked that we don't use images related to people in our paper.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.?
We have referenced the datasets by citing them and attaching the URLs in the paper.
B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created? Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results. For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be. section: Appendix C Did you run computational experiments? section: 3. Experiments C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?