Multi-Modal Image Captioning for the Visually Impaired

One of the ways blind people understand their surroundings is by clicking images and relying on descriptions generated by image-captioning systems. Current work on captioning images for the visually impaired do not use the textual data present in the image when generating captions. This problem is critical as many visual scenes contain text, and 21% of the questions asked by blind people about the images they click pertain to the text present in them. In this work, we propose altering AoANet, a state-of-the-art image-captioning system, to leverage text detected in the image as an input feature. In addition, we use a pointer-generator network to copy detected text to the caption when tokens need to be reproduced accurately. Our model outperforms AoANet on the benchmark dataset VizWiz, giving a 35% and 16.2% performance improvement on CIDEr and SPICE scores, respectively.


Introduction
Image Captioning as a service has helped people with visual impairments to learn about images they take and to make sense of images they encounter in digital environments. Applications such as (Tap-TapSee, 2012) allow the visually impaired to take photos of their surroundings and upload them to get descriptions of the photos. Such applications leverage a human-in-the-loop approach to generate descriptions. In order to bypass the dependency on a human, there is a need to automate the image captioning process. Unfortunately, the current state-of-the-art (SOTA) image captioning models are built using large, publicly available, crowdsourced datasets which have been collected and created in a contrived setting. Thus, these models perform poorly on images clicked by blind people * Equal contribution largely because the images clicked by blind people differ dramatically from the images present in the datasets. To encourage solving this problem, Gurari et al. (2020) released the VizWiz dataset, a dataset comprising of images taken by the blind. Current work on captioning images for the blind do not use the text detected in the image when generating captions (Figures 1a and 1b show two images from the VizWiz dataset that contain text). The problem is critical as many visual scenes contain text and up to 21% of the questions asked by blind people about the images clicked by them pertain to the text present in them. This makes it more important to improvise the models to focus on objects as well as the text in the images.
With the availability of large labelled corpora, image captioning and reading scene text (OCR) have seen a steady increase in performance. However, traditional image captioning models focus only on the visual objects when generating captions and fail to recognize and reason about the text in the scene. This calls for incorporating OCR tokens into the caption generation process. The task is challenging since unlike conventional vocabulary tokens which depend on the text before them and therefore can be inferred, OCR tokens often cannot be predicted from the context and therefore represent independent entities. Predicting a token from vocabulary and selecting an OCR token from the scene are two rather different tasks which have to be seamlessly combined to tackle this task.
In this work, we build a model to caption images for the blind by leveraging the text detected in the images in addition to visual features. We alter AoANet, a SOTA image captioning model to consume embeddings of tokens detected in the image using Optical Character Recognition (OCR). In many cases, OCR tokens such as entity names or dates need to be reproduced exactly as they are in the caption. To aid this copying process, we employ a pointer-generator mechanism. Our contributions are 1) We build an image captioning model for the blind that specifically leverages text detected in the image. 2) We use a pointer-generator mechanism when generating captions to copy the detected text when needed.
(a) Model: a bottle of water is on top of a

Related Work
Automated image captioning has seen a significant amount of recent work. The task is typically handled using an encoder-decoder framework; imagerelated features are fed to the encoder and the decoder generates the caption (Aneja et al., 2018;Yao et al., 2018;Cornia et al., 2018). Language modeling based approaches have also been explored for image captioning (Kiros et al., 2014;Devlin et al., 2015). Apart from the architecture, image captioning approaches are also diverse in terms of the features used. Visual-based image captioning models exploit features generated from images. Multimodal image captioning approaches exploit other modes of features in addition to image-based features such as candidate captions and text detected in images (Wang et al., 2020;. The task we address deals with captioning images specifically for the blind. This is different from traditional image captioning due to the authenticity of the dataset compared to popular, synthetic ones such as MS-COCO (Chen et al., 2015) and Flickr30k (Plummer et al., 2015) . The task is relatively less explored. Previous works have solved the problem using human-in-the-loop approaches (Aira, 2017;BeSpecular, 2016;TapTapSee, 2012) as well as automated ones (Microsoft; Facebook). A particular challenge in this area has been the lack of an authentic dataset of photos taken by the blind. To address the issue, Gurari et al. (2020) created VizWiz-Captions, a dataset that consists of descriptions of images taken by people who are blind. In addition, they analyzed how the SOTA image captioning algorithms performed on this dataset. Concurrent to our work, Dognin et al. (2020) created a multi-modal transformer that consumes ResNext based visual features, object detection-based textual features and OCR-based textual features. Our work differs from this approach in the following ways: we use AoANet as our captioning model and do not account for rotation invariance during OCR detection. We use BERT to generate embeddings of the OCR tokens instead of fastText. Since we use bottom-up image feature vectors extracted using a pre-trained Faster-RCNN, we do not use object detection-based textual features. Similarly, since the Faster-RCNN is initialized with ResNet-101 pre-trained for classification, we do not explicitly use classification-based features such as those generated by ResNext.
We explored copy mechanism in our work to aid copying over OCR tokens from the image to the caption. Copy mechanism has been typically employed in textual sequence-to-sequence learning for tasks such as summarization (See et al., 2017;Gu et al., 2016). It has also been used in image captioning to aid learning novel objects (Yao et al., 2017;Li et al., 2019). Also, Sidorov et al. (2020) introduced an M4C model that recognizes text, relates it to its visual context, and decides what part of the text to copy or paraphrase, requiring spatial, semantic, and visual reasoning between multiple text tokens and visual entities such as objects.

Dataset
The Vizwiz Captions dataset (Gurari et al., 2020) consists of over 39, 000 images originating from people who are blind that are each paired with five captions. The dataset consists of 23, 431 training images, 7, 750 validation images and 8, 000 test images. The average length of a caption in the train set and the validation set was 11. We refer readers to the VizWiz Dataset Browser (Bhattacharya and Gurari, 2019) as well as the original paper by Gurari et al. (2020) for more details about the dataset.

Approach
We employ AoANet as our baseline model. AoANet extends the conventional attention mechanism to account for the relevance of the attention results with respect to the query. An attention mod-ule f att (Q, K, V ) operates on queries Q, keys K and values V . It measures the similarities between Q and K and using the similarity scores to compute a weighted average over V .
where q i ∈ Q is the i th query, k j ∈ K and v j ∈ V are the j th key/value pair, f sim is the similarity function, D is the dimension of q i andv i is the attended vector for query q i . The AoANet model introduces a module AoA which measures the relevance between the attention result and the query. The AoA module generates an "information vector", i, and an "attention gate", g, both of which are obtained via separate linear transformations, conditioned on the attention result and the query: AoA module then adds another attention by applying the attention gate to the information vector to obtain the attended informationî.
The AoA module can thus be formulated as: The AoA module is applied to both the encoder and decoder. The model is trained by minimizing the cross-entropy loss: where y * 1:T is the ground truth sequence. We refer readers to the original work (Huang et al., 2019) for more details. We altered AoANet using two approaches described next.

Extending Feature Set with OCR Token Embeddings
Our first extension to the model was to increase the vocabulary by incorporating OCR tokens. We use an off-the-shelf text detector available -Google Cloud Platform's vision API (Google). After extracting OCR tokens for each image using the API, we use a standard stopwords list 1 as part of necessary pre-processing. We use this API to detect text in an image and then generate an embedding for each OCR token that we detect using a pre-trained base, uncased BERT (Devlin et al., 2019) model. The image and text features are fed together into the AoANet model. We expect the BERT embeddings to help the model direct its attention towards the textual component of the image. Although we also experiment with a pointer-generator mechanism explained in Section 4.2, we wanted to leverage the model's inbuilt attention mechanism that currently performs as a state of the art model and guide it towards using these OCR tokens.
Once the OCR tokens were detected, we conducted two different experiments with varying sizes of thresholds. We first put a count threshold of 5 i.e. we only add words to the vocabulary which occur 5 or more times. With this threshold, the total words added were 4, 555. We then put a count threshold of 2. With such a low threshold, we expect a lot of noise to be present in the OCR tokens vocabularyhalf-detected text, words in a different language, or words that do not make sense. With this threshold, the total words added were 19, 781. A quantitative analysis of the OCR tokens detected and their frequency is shown in Figure 2.

Copying OCR Tokens via Pointing
In sequence-to-sequence learning, there is often a need to copy certain segments from the input sequence to the output sequence as they are. This can be useful when sub-sequences such as entity names or dates are involved. Instead of heavily relying on meaning, creating an explicit channel to aid copying of such sub-sequences has been shown to be effective (Gu et al., 2016).
In this approach, in addition to augmenting the input feature set with OCR token embeddings, we employ the pointer-generator mechanism (See et al., 2017) to copy OCR tokens to the caption when needed. The decoder then becomes a hybrid that is able to copy OCR tokens via pointing as well as generate words from the fixed vocabulary. A soft-switch is used to choose between the two modes. The switching is dictated by generation probability, p gen , calculated at each time-step, t, as follows: where σ is the sigmoid function and w h , w s , w x and b ptr are learnable parameters. c t is the context vector, h t is the decoder hidden state and x t is the input embedding at time t in the decoder. At each step, p gen determines whether a word has to be generated using the fixed vocabulary or to copy an OCR token using the attention distribution at time t. Let extended vocabulary denote a union of the fixed vocabulary and the OCR words. The probability distribution over the extended vocabulary is given as: P vocab is the probability of w using the fixed vocabulary and a is the attention distribution. If w does not appear in the fixed vocabulary, then P vocab is zero. If w is not an OCR word, then i:w i =w a t i is zero.

Experiments
In our experiments, we alter AoANet as per the approaches described in Section 4 and compare these with the baseline model. AoANet-E refers to AoANet altered as per the approach described in Section 4.1. To observe the impact of the number of OCR words added to the extended vocabulary, we train two Extended variants: (1) E5: Only OCR words with frequency greater than or equal to 5. (2) E2: Only OCR words that occur with frequency greater than or equal to 2. AoANet-P refers to AoANet altered as per the approach described in Section 4.2. The extended vocabulary consists of OCR words that occur with frequency greater than or equal to 2. We use the code 2 released by the authors of AoANet to train the model. We cloned the repository and made changes to extend the feature set and the vocabulary using OCR tokens as well as to incorporate the copy mechanism during decoding 3 . We train our models on a Google Cloud VM instance with 1 Tesla K80 GPU. Like the original work, we use a Faster- RCNN (Ren et al., 2015) model pre-trained on ImageNet (Deng et al., 2009) and Visual Genome (Krishna et al., 2017) to extract bottom-up feature vectors of images. The OCR token embeddings are extracted using a pre-trained base, uncased BERT model. The AoANet models are trained using the Adam optimizer and a learning rate of 2e − 5 annealed by 0.8 every 3 epochs as recommended in Huang et al. (2019). The baseline AoANet is trained for 10 epochs while AoANet-E and AoANet-P are trained for 15 epochs.

Results
We show quantitative metrics for each of the models that we experimented with in Table 1. We show qualitative results where we compare captions generated by different models in Table 2. Note that none of the models were pre-trained on the MS-COCO dataset as Gurari et al. (2020) have done as part of their experimenting process.
We compare different models and find that merely extending the vocabulary helps to improve model performance on the dataset. We see that the AoAnet-E5 matches the validation scores for AoANet but we see an improvement in the CIDEr score. Moreover, we see a massive improvement in validation and test CIDEr scores for AoANet-E2. Similarly, we see a gain in the other metrics too. This goes to show that the BERT embeddings generated for each OCR token for the images do provide an important context to the task of generating captions. Moreover, we see the AoANet-P scores, where we use pointer-generator to copy OCR tokens after extending the vocabulary also perform better than our baseline AoANet model. This goes to show that an OCR copy mechanism is an essential task in generating image captions. Intuitively, it makes sense because we would expect to humans to use these words while generating lengthy captions ourselves. We feel that top-k sampling is a worthwhile direction of thought especially when we would like some variety in the captions. Beam-search is prone to preferring shorter captions, as the probability values for longer captions accumulates smaller values as discussed by Holtzman et al. (2019).

Error Analysis
Although there have been concerns about the robustness of the GCP API towards noise (Hosseini et al., 2017), we focused our attention on the model's captioning performance and on the pointergenerator mechanism. We agree that the API's performance might hinder the quality of the captions generated but we expected it to not have a large enough impact.
We first look at how the Extended variants compare with the baseline. We observe that adding text-based features to the feature set imparts useful information to the model. In 2a, AoANet perceives the card as a box of food. Addition of text features enables AoANet-E5 to perceive it as a box with black text. While not entirely correct, it is an improvement over the baseline. The alteration also encourages it to be more specific. When the model is unable to find the token that entails specificity, it resorts to producing UNK. Extending the vocabulary to accommodate more OCR words helps address this problem. In image 2b, baseline AoANet is unable to recognize that the bottle is a supplements bottle. AoANet-E5 attempts to be specific but since 'dietary' and 'supplement' are not present in the extended vocabulary, it outputs UNK. AoANet-E2 outputs a much better caption. We see a similar pattern in 2c.
We now look at how the Pointer variant performs compared to the baseline and the Extended variant. Incorporating copy mechanism helps the Pointer variant in copying over OCR tokens to the caption. AoANet-P is able to copy over 'oats' and 'almonds' in 2d and the token 'rewards' in 2e. But the model is prone to copying tokens multiple times as seen in images 2b and 2f. This is referred to as repetition which is a common problem in sequenceto-sequence models (Tu et al., 2016) as well as in pointer generator networks. Coverage mechanism (Tu et al., 2016;See et al., 2017) is used to handle this and we wish to explore this in the future.

Conclusion
In this work, we propose a pointer-generator based image captioning model that deals specifically with images taken by people with visual disabilities. Our alteration of AoANet shows significant improvement on the VizWiz dataset compared to the baseline. As stated in Section 7, we would like to explore coverage mechanism in the future. Dognin et al. (2020) recently discussed their winning entry to the VizWiz Grand Challenge. In addition, Sidorov et al. (2020) introduced a model that has shown to gain significant performance improvement by using OCR tokens. We intend to compare our model with these and improve our work based on the observations made.

Acknowledgements
The authors would like to thank Mohit Iyyer and Kalpesh Krishna at the University of Massachusetts, Amherst for their invaluable guidance and suggestions. The authors would also like to thank the University of Massachusetts, Amherst for providing the necessary resources throughout the course of this project.  (e) AoANet: a hand holding a box of chocolate 's brand AoANet-E5: a person is holding a package of food AoANet-E2: a hand holding a card with a number on it AoANet-P: a person is holding a box of rewards card GT1: Appears to be a picture of a reward card GT2: A plastic card that says speedy rewards membership card. GT3: A Speedy Rewards membership card with a large gold star displayed on it. GT4: a human hold some cards like credit cards and reward cards GT5: Rewards membership card from the Speedway chain of stores.  Table 2: Examples of captions generated by AoANet, AoANet-E5 (extended vocabulary variant with OCR frequency threshold as 5), AoANet-E2 (extended vocabulary variant with OCR frequency threshold as 2) and AoANet-P (pointer-generator variant) for validation set images along with their respective ground truth captions.