ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition

Recently, Multi-modal Named Entity Recognition (MNER) has attracted a lot of attention. Most of the work utilizes image information through region-level visual representations obtained from a pretrained object detector and relies on an attention mechanism to model the interactions between image and text representations. However, it is difficult to model such interactions as image and text representations are trained separately on the data of their respective modality and are not aligned in the same space. As text representations take the most important role in MNER, in this paper, we propose Image-text Alignments (ITA) to align image features into the textual space, so that the attention mechanism in transformer-based pretrained textual embeddings can be better utilized. ITA first aligns the image into regional object tags, image-level captions and optical characters as visual contexts, concatenates them with the input texts as a new cross-modal input, and then feeds it into a pretrained textual embedding model. This makes it easier for the attention module of a pretrained textual embedding model to model the interaction between the two modalities since they are both represented in the textual space. ITA further aligns the output distributions predicted from the cross-modal input and textual input views so that the MNER model can be more practical in dealing with text-only inputs and robust to noises from images. In our experiments, we show that ITA models can achieve state-of-the-art accuracy on multi-modal Named Entity Recognition datasets, even without image information.


Introduction
Named Entity Recognition (NER) (Sundheim, 1995) has attracted increasing attention in natural language processing community. It has been applied to a lot of domains such as news (Tjong Kim Sang, 2002;Tjong Kim Sang and De Meulder, 2003), E-commerce (Fetahu et al., 2021), social media (Strauss et al., 2016;Derczynski et al., 2017) and bio-medicine (Dogan et al., 2014;Li et al., 2016). Several recent studies focus on improving the accuracy of NER models through utilizing image information (MNER) in tweets Moon et al., 2018;Lu et al., 2018). Most approaches to MNER use the attention mechanism to model the interaction between image and text representations Zhang et al., 2021a;, in which image representations are from a pretrained feature extractor, i.e. ResNet , and text representations are extracted from pretrained textual embeddings, i.e. BERT (Devlin et al., 2019). Since these models are separately trained on datasets of different modalities and their feature representations are not aligned, it is difficult for the attention mechanism to model the interaction between the two modalities.
Recently, pretrained vision-language (V+L) models such as LXMERT (Tan and Bansal, 2019), UNITER  and Oscar (Li et al., 2020b) have achieved significant improvement on several cross-modal tasks such as image captioning, VQA (Agrawal et al., 2015), NLVR (Young et al., 2014) and image-text retrieval (Suhr et al., 2019). Most pretrained V+L models are trained on image-text pairs and simply concatenate text features and image features as the input of pretraining. There are, however, two problems. First, texts in these datasets mainly contain common nouns instead of named entities 2 which leads to an inductive bias over common nouns and images. Second, despite its important role in pretraining V+L models, the image modality only plays an auxiliary role in MNER for disambiguation, and can sometimes even be discarded. These problems make pretrained V+L models perform weaker than pretrained language models for MNER.
Pretrained textual embeddings such as BERT, XLM-RoBERTa (Conneau et al., 2020) and LUKE (Yamada et al., 2020) have achieved state-of-the-art performance on various NER datasets through simple fine-tuning of pretrained textual embeddings. Since most of the transformer-based pretrained textual embeddings are trained over long texts, recent work (Akbik et al., 2019;Schweter and Akbik, 2020;Yamada et al., 2020; has shown that introducing document-level contexts can significantly improve the accuracy of a NER model. The attention mechanism in transformerbased pretrained textual embeddings can utilize contexts to improve the token representation of a sequence. Moreover, pretrained V+L models such as Oscar and VinVL (Zhang et al., 2021b) can use object tags detected in images to significantly ease the alignments between text and image features. Therefore, the images in MNER can be converted to texts as well so that the image representations can be aligned to the space of text representations. As a result, the attention module of the pretrained textual embeddings have the capability to easily model the interactions between aligned image and text representations, without introducing a new attention module. In this paper, we propose ITA, a simple but effective framework for Image-Text Alignments. ITA converts an image into visual contexts in textual space by multi-level alignments. We concatenate the NER texts with the visual contexts as a new cross-modal input view and then feed it into a pretrained textual embedding model to improve the token representations of NER texts, which are fed into a linear-chain CRF (Lafferty et al., 2001) layer for prediction. In practice, a MNER model should be robust when there is only text information, as images may be unavailable or can introduce noises. Sometimes it is even undesirable to use images as image feature extraction can be inefficient in online serving. Therefore, we further propose to utilize the cross-modal input view to improve the accuracy of textual input view, based on cross-view alignment that minimizes the KL divergence over the probability distributions of the two views.
ITA can be summarized in four aspects: 1. Object Tags as Local Alignment: ITA locally extracts object tags and its corresponding attributes of image regions from an object detector. We show in experiments that ITA can significantly improve the model accuracy on MNER datasets and achieve the state-of-the-art. The cross-view alignment module can significantly improve both the cross-modal and textual input views, and bridge the performance gap between the two views.

Approaches
We consider the NER task as a sequence labeling problem. Given a sentence w = {w 1 , · · · , w n } with n tokens and its corresponding image I, an sequence labeling model aims to predict a label sequence y = {y 1 , · · · , y n } at each position. In our framework, we focus on incorporating visual information to improve the representations of the input tokens by aligning visual and textual information effectively. We use a visual context generator to convert the image I into texts forming visual contexts w = {w 1 , · · · , w m } with m tokens. We then concatenate the input text and visual contexts as a cross-modal text+image (I+T) input view instead of the text (T) input view. We feed the I+T input into a pretrained textual embeddings model to get stronger token representations of the input sentence. Then the token representations are fed into a linear-chain CRF layer to get the label sequence y. To further improve the model accuracy of both input views, we use the cross-view alignment module to align the output distributions of I+T and T  Figure 1: The architecture of ITA. ITA aligns an image into object tags, image captions and texts from OCR. ITA takes them as visual contexts and then feeds them together with the input texts into the transformer-based embeddings. In the cross-view alignment module, ITA minimizes the distance between the output distribution of cross-modal inputs and textual inputs.
input views during training. The architecture of our framework is shown in Figure 1.

NER Model Architecture
We use a neural model with a linear-chain CRF layer, a widely used approach for the sequence labeling problem (Huang et al., 2015;Akbik et al., 2018;Devlin et al., 2019). The input is fed into a transformer-based pretrained textual embeddings model and the output token representations {r 1 , · · · , r n } are fed into the CRF layer: where θ is the model parameters, Y(w) is the set of all possible label sequences given the input w. Given the gold label sequenceŷ in the training data, the objective function of the model for the T input view is: The loss can be calculated using Forward algorithm.

Image-text Alignments
The transformer-based pretrained textual embeddings have strong representations over texts. Therefore, ITA converts the image information into textual space through generating texts from the image so that the learning of the self-attention in the transformer-based model can be significantly eased compared with simply using image features from an object detector. We propose a local (LA), a global (GA) and an optical character alignment (OCA) approaches for alignments.
Object Tags as Local Alignment Given an image, the image information can be decomposed into a set of objects in local regions. The object tags of each region textually describe the local information in the image. To extract the objects, we use an object detector OD to identify and locate the objects in the image: The attribute predictions from the object detector contain multiple attribute tags a i for each object o i . We linearize and sort the objects in a descending order based on the confidences of the detection model. For each object, we heuristically keep 0 to 3 attributes with confidence scores above a threshold m. We linearize the attributes and put the attributes before the corresponding objects since the attributes are the adjectives describing the object tags. As a result, we take the predicted l object tags o and their attribute tags a from the object detector as the locally aligned visual contexts w LA :

Image Captions as Global Alignment
Though the local alignment can localize the image into objects, the objects cannot fully describe the of the whole image. Image captioning is a task that predicts the meaning of an image. Therefore, we align the image into k image captions by an image captioning model IC: where {w 1 , w 2 , · · · , w k } are captions generated from beam search with k beams. We concatenate the k captions together with a special separate token [X] to form the aligned global visual contexts w GA : The exact label (e.g. "[SEP]" in BERT) of the special [X] token depends on the selection of embeddings.
Optical Character Alignment Some image contain text when they are created to enrich the semantic information that the images want to convey. In order to better understand this type of image, we use an OCR model to identify and extract the texts in the image: where w OCA are the texts extracted by the OCR model. Note that w OCA may be an empty text if there is no text in the image. We concatenate the input sentence and our aligned visual contexts to form the I+T input vieŵ w = [w; w ], where w can be one of w LA , w GA , w OCA or the concatenation of all (we denote it as All). The transformer-based embeddings are fed with the I+T input view and then output imagetext fused token representations for each token {r 1 , · · · , r n }. The token representations are fed into the CRF layer to get the probability distribution p θ (y|ŵ). Similar to Eq. 1, the objective function of the model for the I+T input view is: Cross-View Alignment There are several limitations in incorporating images into NER prediction: 1) the images may not available in testing; 2) aligning images to texts requires several pipelines in pre-processing instead of an end-to-end manner, which is so time-consuming that it is not applicable to some time-critical scenes such as online serving; 3) the noises in the image can mislead the MNER model to make wrong predictions. To alleviate these issues, we propose Cross-View Alignment (CVA), which targets at reducing the gap between the I+T and T input views over the output distributions so that the MNER model can better utilize the textual information in the input. During training, CVA minimizes the KL divergence over the probability distribution of I+T and T input views: Since the I+T input view has additional visual information in the input and we want the T input view to match the accuracy of I+T input view, we only back-propagate through p θ (y|w) in Eq. 3. Therefore, Eq. 3 is equivalent to calculating the cross-entropy loss over the two distributions: As the set of all possible label sequences Y(x) is exponential in size, we calculate the posterior distributions of each position p θ (y i |w) and p θ (y i |ŵ) through forward-backward algorithm to approximate Eq. 4: where r * i represents either r i or r i .
Training During training, we jointly train T and I+T input views with the training objective in Eq. 1 and 2 together with the CVA alignment training objective in Eq. 5. As a result, the final training objective for ITA is:

Experiments
We conduct experiments on two MNER datasets.
To show the effectiveness of our approaches, we use two embedding settings and compare our approaches with previous multi-modal approaches.

Settings
Datasets We show the effectiveness of our approaches on Twitter-15, Twitter-17 and SNAP Twitter datasets 3 containing 4,000/1,000/3,357, 3,373/723/723 and 4,290/1,432/1,459 sentences in train/development/test split respectively. The Twitter-15 dataset is constructed by . The SNAP dataset is constructed by Lu et al. (2018) and the Twitter-17 dataset is a filtered version of SNAP constructed by .
Model Configuration For token representations, we use BERT base model to fairly compare with most of the recent work Zhang et al., 2021a;. Recently, XLM-RoBERTa has achieved state-of-the-art accuracy on various NER datasets by feeding the input together with contexts to the model. To further utilize the visual contexts in transformer-based embeddings, we use XLM-RoBERTa large (XLMR) model as another embedding in our experiments. To extract object tags and image captions of the image, we use VinVL (Zhang et al., 2021b), which is a pretrained V+L model based on a newly pretrained large-scale object detector based on the ResNeXt-152 C4 architecture. We use the object detection module of VinVL to predict object tags and their corresponding attributes. The number of object tags and attributes varies over the images and is no more than 100. We set the threshold m to be 0.1 for keeping the attributes of each object. For image captions, we use VinVL large model finetuned on MS-COCO (Lin et al., 2014) captions 4 with CIDEr optimization (Rennie et al., 2017). In our experiments, we use a beam size of 5 with at most 20 tokens for prediction and keep all the 5 captions as the visual contexts. For OCR, we use Tesseract OCR 5 (Smith, 2007), which is an open source OCR engine. We use the default configuration of the engine to extract texts in the image 6 .
Training Configuration During training, we finetune the pretrained textual embedding model by AdamW (Loshchilov and Hutter, 2018)

Results
In Table 1, we compare our approaches with our baselines with different training and evaluation modalities (T for the text-only input view and I+T for the multi-modal input view). Results show that ITA models are significantly stronger than our BERT-CRF and XLMR-CRF baselines (Student's t-test with p < 0.05). For the aligned visual contexts, LA, GA and OCA are competitive in most of the cases. To show the effectiveness of CVA, we report the evaluation results of both input views in evaluation. With CVA, the accuracy of both input views can be improved, especially the T input view. CVA can improve the T input view to be competitive with I+T input view. Moreover, the combination of all the alignments ITA-All +CVA can further improve the model accuracy in most of the cases. The accuracy of the MNER models can be significantly improved if we use XLMR embeddings, which shows the importance of the text modality in MNER. With XLMR embeddings, the model accuracy can be further improved with ITA. The relative improvements over the baseline models are sometimes higher with XLMR than with BERT, which shows that the visual contexts can be further utilized with stronger embeddings.
In Table 2, we compare ITA with previous stateof-the-art approaches. For previous approaches, we report the results including OCSGA, UMT, RIVA, RpBERT, UMGF, which are the proposed approaches of Wu et al. (2021a) respectively. For fair comparison, we report the results of these models based on the BERT base embeddings. Moreover, since most of these previous approaches report the best model accuracy instead of the averaged model accuracy, we use the best model accuracy of ITA-All +CVA over 5 runs. We also report our reproduced results of UMT, Rp-BERT and UMGF on the corresponding datasets. The results show that ITA-All +CVA outperforms all of the previous approaches. On the SNAP dataset, the reported accuracy of RpBERT base is competitive with ITA-All +CVA . However, we find that the accuracy of our reproduced RpBERT base 7 is significantly lower than the reported accuracy, even after careful check of the source code and hyper-parameter tuning. Moreover, the fact that our BERT-CRF baseline achieves competitive accuracy with previous state-of-the-art multi-modal approaches shows that most of the previous work has not fully explored the strength of the text representations for the task. 7 We reproduced the results based on the official code for RpBERT base : https://github.com/ Multimodal-NER/RpBERT  Table 3: Our reproductions of previous baselines and approaches. "Improved" means our improved models based on the UMT code base.
Discussion about Textual Modules As we have shown in Table 1 and 2, the textual baselines (i.e. BERT-CRF) of previous work are significantly lower than that of ours. In most of the previous MNER architectures, the textual modules are mainly based on the baseline architectures with some modifications. We further show the baselines of previous work are not well-trained and how the multi-modal approaches perform with stronger textual modules. In Table 3, we rerun the BERT-CRF baseline based on the released codes of UMT 8 . Based on the code of UMT, we tried to improve the baseline models in the code by using the same loss function as ours 9 . The accuracy of BERT-CRF models in the code are significantly improved but the UMT models based on the improved code are not improved and even get worse in Twitter-17. Therefore, we suspect the UMT model cannot be further improved even with stronger textual modules. Zhang et al. (2021a) also reported the baseline based on the implementation of , so we suspect the UMGF model cannot be improved as well. Therefore, the under-trained textual baselines of previous work make the effectiveness of the images unclear and we show that some of the MNER models perform even weaker than our BERT-CRF model.

Comparison with Other Variants
To further show the effectiveness of ITA, we perform several comparisons between ITA and the following variants of the MNER model in Table 4:  drop slightly comparing with our BERT-CRF baseline, which shows the improvement of our approach is from the visual contexts rather than extending the input sequence length the embeddings.

ITA-Joint:
It is an ablated model of ITA-All +CVA . We train the ITA-All model for both input views without the CVA loss in Eq. 5. The model accuracy is improved moderately with only the T input view while our ITA-All +CVA can improve both input views significantly, which shows the effectiveness of the CVA module of ITA.
ITA-LA BU and ITA-GA BU : We conduct experiments to see how the accuracy changes when using weaker image features. We use Bottom-Up features proposed by Anderson et al. (2018) for object detection and image captioning. The captioning model is a pretrained image captioning model 10 proposed by Luo et al. (2018) Figure 2: A relation between the number of captions input to the MNER model and model accuracy. The x-axis is the number of captions. The y-axis is the averaged F1 score on the test set.
can be improved by using better OCR models.
BERT-CRF +ImgFeat : Instead of ITA, we can directly feed the image region features generated from an object detector into the BERT. We use ResNet-152 model to generate region features and then feed the features into a linear layer to project the region features into the same space of text features in the BERT. Moreover, we compare the model with RpBERT w/o Rp, which is an ablated model of RpBERT and is equivalent to BERT-CRF+ +ImgFeat over the usage of BERT embeddings.  showed RpBERT w/o Rp can improve the model accuracy compared with their baseline. However, our results show that the model accuracy slightly drops comparing with our BERT-CRF, which shows that it is difficult for the attention module of BERT to learn the relations of the unaligned representations of two modalities.
VinVL-CRF: To show how the pretrained V+L models perform on the NER task, we use VinVL since it is a very recent state-of-the-art pretrained V+L model on a lot of multi-modal tasks. We feed the VinVL model with texts and images in the MNER datasets and finetune the model over the task. We take the text representations output from VinVL as the input of the CRF layer. The accuracy of the finetuned VinVL model drops significantly compared to the BERT model, which shows that the inductive bias of the pretrained V+L model hurts the model accuracy on MNER.
BERT+VinVL-CRF: As the VinVL model may lead to an inductive bias over the common nouns and the image, we jointly finetune the BERT and VinVL models and concatenate the output text representations of the two models. The accuracy is improved on a moderate scale, which shows BERT is complementary to VinVL for MNER.

L2 Distance
BERT-CRF+ImgFeat ITA-All ITA-All +CVA Figure 3: Averaged L2 distance between the token representations without image input (r i ) and with image input (r i ). The error bars mean the standard deviation over 5 runs.

Analysis
Effect of the Number of Captions Using more captions output from the captioning model can improve diversities of the visual contexts but can add noises to them as well. To better understand how the number of captions affects the model accuracy, we change the beam size and keep all the sentences output from the captioning model. The trends in Figure 2 show that the model accuracy increases until 5 captions for all the datasets and gradually drops when the number of captions further increases for Twitter-15 and 17 datasets. The observation shows that using 5 captions keeps a good balance between the diversities and correctness of the captions.

How ITA Eases the Cross-Modal Alignments
Previous work such as Moon et al. (2018);  visualized modality attention in several cases to show the effectiveness of their approaches. However, visualizing the multi-layer attention in transformer-based embeddings is relatively difficult. Instead of studying special cases, we statistically calculate the averaged L2 distance between token representations r i and r i from two input modalities to show how the token representations depend on image information. In Figure 3, the L2 distance ITA-All is significantly larger than that of BERT-CRF+ImgFeat. Besides, the standard deviation of BERT-CRF+ImgFeat is very large. The observations show the image region features make the alignment become difficult and unstable while our visual contexts can significantly ease the cross-modal alignments. Moreover, with CVA, the L2 distance becomes much smaller and stable as CVA aligns the two input views to reduce the dependence on images, which shows the MNER model can better utilize the textual information with CVA.

How Images Affect the NER Prediction
To study the effectiveness of the images over each label, we show a comparison between our model and our baselines in Table 5. When the relative improvement of the F1 score is larger than 0.5, the relative improvement of precision is larger than that of recall. The observation shows that the main improvement of MNER is mainly because the images can help the model to reduce false-positive predictions for disambiguation on uncertain entities. 12

Related Work
Multi-modal Named Entity Recognition Most of the previous approaches to MNER focus on the interaction between image and text features through attention mechanisms. Moon et al. (2018) proposed a modality attention network to fuse the text and image features before the input to the BiL-STM layer. Lu et al. (2018) additionally used a visual attention gate for the output features of the BiLSTM layer.  proposed an adaptive co-attention network after the BiLSTM layer to model the interaction between image and text. Recently, Wu et al. (2020) proposed OCSGA, which use object labels to model the interaction between image and object labels in an additional dense co-attention layer. Compared with the work, we show a simpler and more effective way to utilize object labels and additionally use other alignment approaches to further improve the model accuracy.  proposed UMT, which utilized a multi-modal interaction module and an auxiliary entity span detection module for MNER. Zhang et al. (2021a) proposed UMGF, which utilizes a pretrained parser to create the graph connection between visual object tags and textual words. They used a graph attention network to fuse the textual and visual features. In order to better model whether the image is related to the text,  proposed RpBERT, which additionally trains on a text-image relation classification dataset proposed by Vempala and Preoţiuc-Pietro (2019) to prevent the negative effect of noisy images. Comparing with RpBERT, we use CVA to let the NER model better utilize the input sentences without such kinds of supervision. All of these approaches focus on fusing the image and text features through the attention mechanism but ignore the gap between the image and text features while we propose to fully utilize the attention mechanism in the pretrained textual embeddings through

Conclusion
In this paper, we propose Image-Text Alignments for multi-modal named entity recognition, which convert images into object labels, captions and OCR texts to align the image representations into textual space in a multi-level manner and form a cross-modal input view. The model can effectively utilize attention module of the transformerbased embeddings. Considering noises, availability of images and inference speed for practical use, we propose cross-view alignment, which let the MNER models better utilize the text information in the input. In our experiments, we show that ITA significantly outperforms previous state-of-the-art approaches on MNER datasets. We also show that most of the previous work failed to train a good textual baseline while our textual baseline can easily match or even outperform previous multi-modal approaches. In analysis, we further analyze how ITA eases the cross-modal alignments and how the images affect the NER prediction.

A.3 Case Study
Despite that images can generally help to improve the accuracy of the NER model, there are a lot of cases that the images may contain misleading information to hurt the model prediction. We study two cases for LA nad GA: 1) the entities are wrongly predicted by BERT-CRF baseline but are correctly predicted by ITA; 2) the entities are wrongly predicted by ITA without CVA but are correctly predicted by the baseline and ITA with CVA. Figure 4 shows the two cases with two samples for each. Figure 4 (a) shows the first case, which shows the importance of the visual contexts. The baseline model failed to recognize the person entities "TWICE" and "Harry Potter" possibly because the two words are usually an adverb and a book name respectively. For the I+T input view, our MNER model is able to recognize the hints such as "two girls", "young girl", "a couple of young men" and "woman" in the visual contexts and then correctly predict the two entities. Figure 4 (b) shows the second case, which shows how the noises from the image mislead the model predictions. There are three-and two-person entities in gold labels but the visual contexts indicate that the top right image has "two baseball players" and the bottom right image has only "a woman". As a result, ITA without CVA only predict two and one person entities according to the visual contexts in the two samples respectively. However, with CVA, ITA takes a good balance in utilizing the textual and visual information and correctly predicts the entity labels in both T and I+T input views. For OCA, we study how the extracted texts can help model prediction. In the upper sample of Figure 5, there are two "Donald" words in the image. The baseline model failed to identify the latter one while ITA-OCA can successfully identify both of them. In the bottom of Figure 5, the texts in the image are mainly talking about "HARRY STYLES", which helps the model prediction.

A.4 Discussion
In our paper, we use the captioning and object detection model based on MSCOCO and visual genome. The model performance could be improved if we use domain-specific models (Twitter domain). For OCA, the model accuracy may be poor if the OCR system does not support a certain language.