Vision Meets Definitions: Unsupervised Visual Word Sense Disambiguation Incorporating Gloss Information

Visual Word Sense Disambiguation (VWSD) is a task to find the image that most accurately depicts the correct sense of the target word for the given context. Previously, image-text matching models often suffered from recognizing polysemous words. This paper introduces an unsupervised VWSD approach that uses gloss information of an external lexical knowledge-base, especially the sense definitions. Specifically, we suggest employing Bayesian inference to incorporate the sense definitions when sense information of the answer is not provided. In addition, to ameliorate the out-of-dictionary (OOD) issue, we propose a context-aware definition generation with GPT-3. Experimental results show that the VWSD performance significantly increased with our Bayesian inference-based approach. In addition, our context-aware definition generation achieved prominent performance improvement in OOD examples exhibiting better performance than the existing definition generation method.


Introduction
With the development of deep learning technology, research on multimodality such as Visio-Linguistic Models (VLMs) has been actively conducted (Schneider and Biemann, 2022). In particular, state-of-the-art VLMs, such as image-text matching (ITM) models (Radford et al., 2021;Singh et al., 2022) and text-to-image generation models (Rombach et al., 2022;Seneviratne et al., 2022), are employed in many industrial projects, including image retrieval systems (Yuan and Lam, 2021; and AI-assisted image generators (Das and Varshney, 2022;Seneviratne et al., 2022).
Visual Word Sense Disambiguation (VWSD) is a multimodal task of natural language processing (NLP) and computer vision that selects the image which corresponds to the intended meaning of the target word among a set of candidate images (Ra- Figure 1: An example of VWSD from SemEval-2023 task 1 dataset (Raganato et al., 2023). We can see that even if the target word ('Angora') is the same, different images should be selected according to the context. ganato et al., 2023). Figure 1 is an example of VWSD. For the ambiguous target word 1 'Angora', we can notice that the answer image should be conditionally changed regarding the context. VWSD can play an important role in several downstream tasks including image retrieval (Chen et al., 2015), action recognition (Gella et al., 2017) and visual question answering (Whitehead et al., 2020).
Unsupervised VWSD can be formulated in the same way as the ITM task (Cao et al., 2022), that is, finding the images that best match the given context. However, VWSD often requires more complex reasoning on both text and images than conventional ITM models. The example in Figure 2 demonstrates that CLIP (Radford et al., 2021), a state-of-the-art (SOTA) ITM model, fails to recognize the answer image for the given context 2 . This limitation of VLMs, where they fail to handle ambiguous words, was also reported in another study on an image generation model (Rassin et al., 2022).
To ameliorate this problem, we propose to disambiguate visual words with the assistance of a glossary of lexical knowledge-bases (LKBs) without the use of any further training or dataset. Specifically, we utilize the sense definitions of an ambiguous word that have been widely exploited in previous lexical semantic tasks (Raganato et al., 2017;Gella et al., 2017;Pilehvar and Camacho-Collados, 2019). Herein, since the answer sense of the target word is not provided in the VWSD setting, we propose an approach derived from Bayesian inference, using pretained ITM models. Moreover, in order to deal with out-of-vocabulary (OOV) words that cannot find the sense definitions of the target word in LKBs, we suggest the concept of context-aware definition generation (CADG). The definitions of a target word are generated by a large language model, GPT-3 (Brown et al., 2020), as auxiliary information for VWSD.
Experiments were conducted on SemEval-2023 (SE23) Task 1-Visual-WSD (Raganato et al., 2023), a publicly available VWSD dataset. Furthermore, in the experiments, we utilized two pretained SOTA ITM models: (1) CLIP (Radford et al., 2021) and (2) FLAVA (Singh et al., 2022). Experiments showed that our proposed approach significantly improved the performance of baseline ITM models. In addition, we demonstrated that our concept of CADG not only significantly increased the performance of OOV cases but is also more advantageous than the previous definition generation approach. We implement experimental codes in https://github.com/soon91jae/UVWSD.
The contributions of this paper can be summarized as follows: • This paper introduces a new glossincorporated VWSD approach inspired by Bayesian inference.
• Experimental results show that our Bayesian inference-based approach boosted the unsupervised VWSD performance significantly without any additional training.
• Furthermore, we suggest the CADG method to challenge the OOV issue.  Figure 2: Illustrative concepts and an example input on a CLIP model (Radford et al., 2021).
Disambiguation (WSD) which automatically identifies ambiguous words into corresponding senses (O et al., 2018). The early stage of WSD research tried to employ diverse information in LKBs with unsupervised manners such as lexical similarity (Kilgarriff and Rosenzweig, 2000), knowledge-graph connectivity (Agirre et al., 2014;Kwon et al., 2021), and topic modeling (Chaplot and Salakhutdinov, 2018). After the emergence of pretrained language models (LMs) such as BERT (Devlin et al., 2019), LM-based transfer learning approaches have been actively studied (Huang et al., 2019; Barba et al., 2021b). In particular, gloss-enhanced WSD models that use sense definition and context together using a cross-encoder (Huang et al., 2019; Barba et al., 2021a) or bi-encoder (Blevins and Zettlemoyer, 2020) structures are not only overwhelm existing approaches but also robust against few-shot examples. Wahle et al. (2021) suggest incorporating WordNet knowledge into LMs while pre-training them. Specifically, the authors utilize a multi-task learning method that trains LMs with both mask language modeling loss and WSD task loss. Visual Verb Sense Disambiguation (VVSD) is another task relevant to VWSD. VVSD is a multimodal sense disambiguation task that selects the correct sense of a pair of a given ambiguous verb word and image (Gella et al., 2017). Gella et al. (2017) suggest an unsupervised VVSD approach that takes advantage of various Visio-linguistic features (image representation, object label, image caption features) together and calculates the matching score between an image and a sense definition  Figure 3: Illustrative concepts and an example input on our gloss-enhanced framework on a CLIP model. Note that, even though the image encoder and the text encoder are the exactly same as those in Figure 2, our approach can correctly predict the answer image different from the original CLIP model's prediction.
with a variant of Lesk algorithm. Vascon et al. (2021) propose a semi-supervised VVSD method based on game theoretic transduction for inference. Meanwhile, Gella et al. (2019) demonstrate that a VVSD model trained on multi-lingual VVSD dataset not only benefit the performance on verb sense disambiguation but also boost the performance of a downstream task, the multi-modal machine translation task.
Our work is related to gloss-enhanced WSD models in that we are using both sense definition and context together. However, our study differs from previous WSD studies in that it tackles a multi-modal task. It is also relevant to VVSD in terms of multi-modal sense disambiguation. However, VVSD systems (Gella et al., 2016) are usually designed to analyze a small number of verb words, while the VWSD task contains a lot of nouns and adjectives. Finally, our work tackles a new VWSD task and we introduce a method of implementing sense definitions with SOTA ITM models based on Bayesian inference where sense definitions as a latent variable.

Definition Generation
Our CADG is related to the definition generation task introduced by Noraset et al. (2017). The purpose of the task is to generate a definition for a given word. Noraset et al. (2017) suggest utilizing recurrent neural network-based LMs (RNNLMs) with the definitions collected from WordNet and GNU Collaborative International Dictionary of English (GCIDE). Gadetsky et al. (2018) propose definition generation models to handle polysemous words with context and the soft-attention mechanism.  propose to perform semantic decomposition of the meanings of words and then use discrete latent variables to model them to generate definitions. Malkin et al. (2021) show that a large language model (GPT-3) could generate definitions of neologisms without additional fine-tuning. Herein, the authors suggest generating neologisms with long short-term memory (LSTM) (Yu et al., 2019) and definitions of neologisms with a large pretrained LM, GPT-3 (Brown et al., 2020). CADG is similar to the one used by Malkin et al. (2021), which involves generating definitions using GPT-3. However, CADG differs in that it takes context into account when generating prompts. Additionally, this study differs from previous work in that it takes context into account when generating prompts and demonstrates that the definitions produced by CADG can be effectively used in downstream tasks, rather than focusing solely on the definition generation task itself.

Task Definition on Unsupervised VWSD
We formulate unsupervised VWSD as a multiclass classification task (Aly, 2005) as shown in Eq. 1. Unlike the image retrieval task (Jing et al., 2005) that ranks the most relevant images for the given text or keyword, VWSD is designed to choose a specific target t in the given context c. Specifically, we define the task to find the imagev with the highest posterior probability from a set of images V t that consists of one answer image and other distractors on the target word.
Any pretrained ITM models (e.g., CLIP) can calculate the posterior. In Figure 2, a set of candidate images V t is entered into the image encoder for the target word t. At the same time, the context c that includes t as a part is entered into the text encoder. Then, the inner product of the output hidden representations on images h v 1...|V t | and the context h c are input to softmax function, which then computes a probability distribution over the images. Finally, the image that produces the highest probability will be selected as the prediction of the model for the target t, provided the context c.

Unsupervised VWSD Incorporating Gloss Information
Usually, zero-shot ITM models are pretrained without much consideration of polysemous words. For example, Figure 2 demonstrates that CLIP fails to predict the correct answer for the target word 'Angora', although it is provided with a clear hint of 'city' in the given context. Therefore, the zero-shot performance of pretrained ITM models may be limited in the VWSD task. One solution is to use gloss information of a lexical knowledge-base (LKB), particularly exploiting sense definitions. This is because the definitions in LKBs elaborate on each sense for readers who do not know the meaning. Thus, we assume that the sense definitions in LKBs can boost ITM models to conduct VWSD, by injecting the meaning of the correct sense on the input of these models. However, since there is no correct sense information for the target word, it is difficult to apply it directly. For this reason, we suggest a novel gloss-incorporated VWSD approach inspired by Bayesian inference, as presented in Eq. 2. Suppose D t is a set of definitions for the target word t extracted from an LKB. Herein, by using the chain rule, the posterior can be divided into two conditional probabilities associated with a latent variable D t .
In this case, the right term P (D t i |c, t) (Context to Definition; C2D) is predicting the conditional probability over the given ith sense definition D t i for the given target word t and context c which is similar to the gloss-enhanced WSD models (Huang et al., 2019;Blevins and Zettlemoyer, 2020). Meanwhile, the left term P (v|D t i , c, t) (Definition to Image; D2I) is the conditional probability of v for a given the ith sense definition, the context and the target word. In doing so, it allows for the development of sophisticated ITMs by enriching the context with its relevant sense definition. Finally, we can calculate P (v|c, t) by marginalizing over all available sense definitions D t 1...|D t | . Figure 3 demonstrates an illustrative concept of our gloss-incorporated VWSD approach with a pretrained CLIP. First, similar to the original CLIP, a set of candidate images V t and a context c are input to the image encoder and the text encoder, respectively. Meanwhile, a set of definitions of the target word D t is extracted from an LKB. In our work, we utilize WordNet (Miller, 1995) which has been widely used in previous semantic analysis tasks (Pilehvar and Camacho-Collados, 2019; Bevilacqua et al., 2021) as our source of LKB. Then D t , c, and t are jointly inputted to the text encoder with the following template.
{context} : {ith sense's definition} C2D is computed by the inner product of the hidden representations on the definitions d t 1...|D t | and the context h c⊺ . D2I is then calculated by the inner product of the hidden representations of and D2I input to the softmax function transformed into probability distributions. Then, we choose the image with the highest probability as the prediction. As a result, for the example in Figure 3, our model can predict the correct answer of the given context 'Angora city', whereas the original CLIP wrongly selects an image of 'Angora cat' that produced the highest probability (as shown in Figure 2), even though the network topology and the pretrained parameters in our model are the same as the original CLIP model. Define "angora" in angora city. angora (n): A city in Turkey that stands on the banks of the Angora River, near … Template: Prompt: Definition: (b) Our context-aware definition generation. Figure 4: Examples of GPT-3 generated definitions when context, target word, and part-of-speech are 'angora city', 'angora' and 'noun' (n) respectively.

Handling OOV with the Context-Aware Definition Generation
Not all words have their definitions available in a lexical knowledge-base. In particular, proper nouns, compound words, and foreign words frequently induce OOV issues. For example, in the SE23 dataset, about 14.33% of target words' definitions are not found in the English WordNet. Therefore, we propose a solution to tackle the OOV issue with the definition generation approach. A previous study showed that GPT-3 can generate the definition of a novel word (Malkin et al., 2021). However, since this study does not consider the context of the word, it may not generate the definition in the correct sense. Thus, we suggest generating a definition with the prompt that considers both the context and the target word together. Here, we add a conditional sentence that inputs the context of a target word. For example, when the target word is 'angora' and the context is 'angora city', we use a conditional sentence, "Define "angora" in angora city.", in front of the previous input "angora (n)". Indeed, in the example, the definition generated with our method shows a better description compared to the previous method.
6 Experiments 6.1 Experimental Dataset SE23 We used the dataset in the SemEval-2023 Task 1 VWSD challenge 34 . It consists of 12,896 examples and 13,000 candidate images. Each example has 10 candidates that include 1 answer image and 9 distractors. Each context averagely contains 2.5 words. The dataset contains 14.33% OOV words (1,845 out of 12,869).

Experimental Setting
VWSD For the experiments, we adopted two SOTA zero-shot ITM models, CLIP 5 and FLAVA 6 , as pretrained parameters are publicly available for both of them. Note that CLIP uses the text encoder and the image encoder at the same time while FLAVA contains the text encoder, the image encoder, and the multi-modal encoder. Herein, to calculate an image-text matching score, FLAVA uses the multi-modal encoder that cross-encodes image and text features simultaneously. In the case of calculating C2D, we exploit FLAVA's text encoder as the same as Figure 3.
We used WordNet 3.0 7 as the main LKB. We also compare two GPT-3 generated definitions. The first one is Malkin et al. (2021)'s definition generation (DG). The other one is CADG (as described in Section 5). WN+CADG applies CADG's definitions in the case of OOV and uses WordNet definitions otherwise.
Definition Generation We re-implemented Malkin et al. (2021)'s definition generation experimental setting.
Specifically, we sampled a definition for each example by utilizing GPT-3's Davinci variant which is known as the largest model and we generated samples with a temperature of 1.0.
Evaluation Criteria Following Raganato et al. (2023)'s setting, we evaluated VWSD models' performance with the hits at 1 (Hits@1) and the mean reciprocal rank (MRR). Moreover, we used Student's t-test (Student, 1908), to verify the signifi-

Experimental Results
The experimental results in   Figure 5 presents that incorporating WordNet definition enhanced the performance on ambiguous and trivial words in both of CLIP and FLAVA. In particular, the performance gain was remarkable in trivial words (from 71.34 to 85.91 and from 69.83 to 81.99 for CLIP and FLAVA, respectively). Moreover, even for ambiguous words, the performance is significantly improved (p < 1e − 3) without any additional training or the assistance of external systems such as WSD models. CADG substantially increased performance in both of OOV and trival words. Especially, when compared to DG, the performance differences are remarkable in OOV.
Meanwhile, while FLAVA shows prominent improvement via WordNet integration, the impact of generated definitions tends to be low compared to CLIP. Considering that WordNet definitions were manually constructed by experts, we speculate that this is because the model is sensitive to the quality of the input definitions.

Analysis on Ambiguous Target Words
We analyzed the performance change according to the ambiguity level of the ambiguous target word.     Kwon et al., 2022). However, compared to the lower ambiguous cases, the performance improvement rate is lower. These results implies that enhancement for the highly ambiguous words are required. Although WordNet integration improves performance for ambiguous target words, we still want to find out how competitive the performance improvement is. For this reason, we compared the performance of our WordNet-incorporated model with that of the pipeline system using the WSD model. To be specific, T5 SemCor , a finetuned WSD model, predicts WordNet sense in a given target word and context. The probability distribution for the candidate images was calculated based on the predicted sense.  get words. Our model showed comparable results in the pipeline system and Hits@1 and achieved higher performance in MRR. This is due to the error cascading issue of pipeline systems (Finkel et al., 2006;Kwon et al., 2019). That is, in the pipeline system, errors in the WSD model directly lead to performance decrement. Otherwise, our approach is rather free from error cascading, since the C2D probability and the D2I probability work complementary to each other.

Evaluation on the Generated Definitions
In order to evaluate the quality of the generated definitions, we randomly  Table 4 represents the average human agreement scores on DG and CADG. The results show that our CADG achieved a higher performance compared to DG. Especially, in Figure 4 and Table 5, we can find that the definitions of ambiguous words generated with CADG are semantically similar to that of the WordNet answer sense compared to DG, in line with the purpose for which it was designed.

Impact of the Generated Definitions' Quality
We also verified whether the quality of the generated definitions would affect the VWSD performance. Table 6 presents the experimental results on VWSD examples when we utilize the generated definitions that agreed (Correct) and disagreed (Incorrect) by the both annotators. Table 6 demonstrates that the quality of the generated definitions affects the performance of the downstream VWSD task indeed.

Experiments on Multiple Generated Definitions
Since we sampled a definition for each input example in main experiments, it is still questionable whether the number of sampled definitions affects the performance of the model. Table 7 indicates the performance of DG and CADG according to the number of generated definitions (n) for each input. The results show that the number of sampled definitions is not significantly affecting the model's performance. To be specific, when the number of generated definitions is 2 for each input, the performance of DG and CADG increased by 0.09%p and 0.03%p respectively. Furthermore, when the number of generated definitions is 3, we can see that the performance even slightly decreases both DG and CADG. As a result, sampling multiple definitions for each input does not significantly affect performance or rather decreases performance.

VWSD
Our model still suffers from error cascading from C2D probability though it is mitigated by the Bayesian style inference. The most typical error case is due to the error cascading in C2D probability calculation. Especially, due to the nature of neural networks (Guo et al., 2017), the overconfidence in the error classes frequently causes errors. For example, in Table 8, we found that among the 10 senses of the target word 'paddle' extracted from WordNet, the conditional probability for the correct sense was calculated as 0.00%, resulting in an error in the final posterior calculation. Another error case is when there is no correct sense in WordNet. In the example, the target word 'Thompson' indicates a firearm, but WordNet contains only personal information. This is a separate issue from OOV with no entry for the target word, and we observed that it mainly occurs in proper nouns.

Definition Generation
We found two representative error cases in the results of the definition generations: 1) misdisambiguation and 2) hallucination. The misdisambiguation is when the GPT3 generates the polysemy's definition. In Figure 6a, considering the context of "lime oxide", we would expect a definition of lime stone to be generated. However, we can notice that both approaches generate a definition for lime fruit. On the other hand, as pointed out in previous research (Ishii et al., 2022), we also observed that GPT3 generates hallucinations. Figure 6b is an example of the hallucination issue. albatrellus which is a type of a fungi in the context of "albatrellus genus," nevertheless the definitions generated by both approaches are pertaining to the albatross, a species of bird. Detailed examples of error cases can be found in Appendix A.

Conclusion and Future Work
This paper introduces a novel VWSD methodology to effectively incorporate gloss information from an external resource. Our work mainly has two innovations: 1) Bayesian style inference for SOTA ITMs, and 2) Context-aware definition generation with GPT-3 to overcome the OOV issue. Experimental results show that our proposed Bayesian style inference-based WordNet integration significantly improves VWSD performance without additional training. For the ambiguous target words, the performance of our approach is comparable to pipeline systems using finetuned WSD models. Moreover, context-aware definition generation helps mitigate OOV issues in the downstream VWSD tasks and shows higher performance compared to the previous definition generation approach.
In the future, we plan to tackle the error cascading caused by over-confidence in C2D probability. For this, we may explore a prompting that is known to have good performance in zero-shot prediction (Liu et al., 2023). In addition, to deal with the hallucination and misdisambiguation problems of GPT-3 generated definitions, we may employ controllable generation by resampling (Ji et al., 2022).

Limitations
Our work has the following limitations. First, we only used one evaluation data, namely SE23, because it is the only data suitable for the VWSD setting, especially for the OOV examples. In addition, our methodology relies entirely on WordNet. Therefore, this may be limited the model's ability when the target word is a proper noun such as a named entity. Finally, we depend on the results of GPT-3 definition generation to handle OOV words. Since the generated definitions may contain errors, as revealed in the qualitative analyses, the errors led to incorrect predictions.
Lime refers to both a fruit and a color. As a fruit, lime is a citrus fruit that …

Ethical Consideration
The generated definitions were annotated by two annotators. Both annotators were fully paid by complying with local minimum wage regulation. In addition, in the sampled definition generations, the authors could not find any statements violating the ACL anti-harassment policy. However, generated definitions that authors have not vetted are still at risk of containing toxic or hates contents (e.g. racism, insulting or xenophobic).   Table 9 present the all incorrectly generated definitions that described in Section 8. Herein, we found the following three error types: 1) Misdisambiguation, 2) Hallucination, and 3) Others. First of all, the misdisambiguation cases are caused by bias in the pretraining, and we can notice that CADG has less misdisambiguation compared to DG. Especially, we can see that GPT-3 generated more than one definitions of the target words 'conch', 'reaper', and 'ruin' in DG, while we could not found such cases in our approach. On the other hand, hallucination cases are when the generated definitions are definitions of completely different terms with similar spellings ('stonechat' of CADG, 'driftfish' of DG), or cases in which the detailed descriptions are incorrect although they are somewhat similar ('osteostraci' of CADG, 'nestor' of DG). Especially, in Table 10 of 'wulfenite' and 'cordierite,' we can notice that definitions are generated with parts of each lexicon ("wulfen," and "cord"). Finally, in other cases, the generated definitions may not be in definition form ('lynching' of CADG, 'areca' of DG), or the contents of the target word is output as itself ('wulfenite' of CADG