STAIR: Learning Sparse Text and Image Representation in Grounded Tokens

Image and text retrieval is one of the foundational tasks in the vision and language domain with multiple real-world applications. State-of-the-art approaches, e.g. CLIP, ALIGN, represent images and texts as dense embeddings and calculate the similarity in the dense embedding space as the matching score. On the other hand, sparse semantic features like bag-of-words models are more interpretable, but believed to suffer from inferior accuracy than dense representations. In this work, we show that it is possible to build a sparse semantic representation that is as powerful as, or even better than, dense presentations. We extend the CLIP model and build a sparse text and image representation (STAIR), where the image and text are mapped to a sparse token space. Each token in the space is a (sub-)word in the vocabulary, which is not only interpretable but also easy to integrate with existing information retrieval systems. STAIR model significantly outperforms a CLIP model with +$4.9\%$ and +$4.3\%$ absolute Recall@1 improvement on COCO-5k text$\rightarrow$image and image$\rightarrow$text retrieval respectively. It also achieved better performance on both of ImageNet zero-shot and linear probing compared to CLIP.

Preprint. Figure 1. Learning a sparse and interpretable embedding for a vision-language model. CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021) learn a dense D-dimensional embedding. In contrast, the goal of STAIR is to learn a sparse and interpretable embedding in a high-dimensional space. Each dimension in the sparse embedding is a (sub-)word in a large dictionary in which the predicted non-negative scalar corresponds to the weight associated with the token. have been fueled by a renewed interest in contrastive learning. State-of-the-art contrastive learning models such as CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021) use image and text specific encoders to embed images and text in a joint embedding space. The model is trained to improve the cosine similarity for the aligned image-text pairs and dissimilarity for the unmatched ones. Such contrastive learning models can achieve state-of-the-art fine-tuning performance as well as strong zero-shot generalization results on image-text retrieval, VQA, and image classification.
Despite the impressive benchmark performance, the dense embedding space is usually considered a black box which is hard and unintuitive to interpret the meaning of the embeddings. Furthermore, as a model trained with retrieval (contrastive) objective, deploying it to a real-world image-text retrieval product is challenging. Despite approximated nearest neighbor search (Guo et al., 2020;Johnson et al., 2019) can be used to retrieve from a dense embedding space, the cost can be high when scaling up to billions of images. Lastly, as the embeddings live in a dense space, traditional approaches arXiv:2301.13081v1 [cs.CV] 30 Jan 2023 developed by search engines, such as inverted index, cannot be deployed naively. The difficulty of interoperability of the dense embeddings also makes it challenging to be combined with other retrieval features without additional training.
The computer vision field has made many efforts to explore sparse, interpretable representations of images. The bag-ofwords and topic models (Csurka et al., 2004;Sivic et al., 2005;Fei-Fei & Perona, 2005;Lazebnik et al., 2006) were widely used together with the SIFT descriptor (Lowe, 2004), but later researchers found its performance is inferior to dense vectors (Lin et al., 2011;Sánchez et al., 2013). Another effort in the deep CNN era is to develop the deep visual-semantic embedding (Frome et al., 2013) using Im-ageNet topics, which is more interpretable but still fails to outperform dense representations (Faghri et al., 2017).
We hypothesize that the gap between sparse semantic and dense embedding stems from at least two factors: (1) Previous works on semantic embedding do not effectively explore large-scale training to model the rich semantics in imagetext space.
(2) Most existing semantic embedding methods are built on a fixed vocabulary (e.g., several thousand concepts), which cannot handle out-of-vocabulary concepts. In this paper, we present a new model, named STAIR, and a multi-stage training recipe to learn a Sparse Text And Image Representation to tackle the aforementioned challenges. We will show that our sparse representation can not only match but also outperform the state-of-the-art of image-text representations. Inspired by the recent success of the sparse representation in information retrieval field (Bai et al., 2020;Formal et al., 2021b), STAIR encodes the image and text into a sparse and grounded token space, as illustrated in Figure 1. Concretely, images and text are mapped to sparse embeddings that are grounded to actual (sub-)words with non-negative weights. The sparse embedding is straightforward to 1) interpret and understand the model's prediction; 2) conduct a largescale retrieval using an efficient inverted index approach; 3) combine with other text features by simply biasing the token weights in the sparse token space. The proposed multi-stage training is critical to the grounding of the model.
We compare the STAIR model and a CLIP model trained using the same model architecture and dataset. Experiment results show the STAIR model significantly outperforms the CLIP model on image-text retrieval tasks, with +4.9% and +4.3% recall@1 on COCO-5K text→image and image→text retrieval respectively. STAIR models can also achieve similar or better performance on zero-shot classification and linear probing tasks including ImageNet.
Furthermore, our sparse embedding is easier for humans to interpret. We define an interpretable space using BERT vocab to quantify the model interpretability. Experiments show STAIR is significantly more interpretable, with the Top-1 accuracy 32.9% (STAIR) vs 13.7% (CLIP) on ImageNet.

Approach
We begin with a review of a typical dual-encoder architecture to motivate our research.

Dual-Encoder Contrastive Learning
Given a dataset of image-text pairs D = {(x i , y i )}, where x i and y i represent the image and text respectively, a dual encoder learns a similarity M (x, y), so that the aligned pair (x i , y i ) has higher similarity score than the unmatched pairs sampled from a negative sample set D i .
A dual-encoder architecture consists of an image encoder E IMAGE and a text encoder E TEXT . Each encoder E has two components: 1) a standard, input-specific, neural network f (·) and 2) a projection head g(·) that maps the features as embeddings in a joint dense embedding space: State-of-the-art approaches usually set f (·) as a transformer model and g(·) as a pooling layer.
Given an image x and text y inputs, their similarity is measured by the cosine similarity: It is used to train the dual encoder by minimizing the contrastive loss L CON : where D i = D i ∪ {(x i , y i )} denotes the positive pair and its negative set for each pair, and T is the softmax temperature.

STAIR
Following CLIP (Radford et al., 2021), STAIR employs a dual-encoder architecture. As illustrated in Figure 2, it contains a pair of image and text encoders with sequential feature extraction network f (·) and projection head g(·). In particular, the dense project head g(·) is replaced with a Token Projection Head, which maps the representation to a sparse embedding space. A vocabulary V is used as the basis of embedding space for the purpose of interpretability.
The token projection head g(·) consists of two components: 1) a mapping function that maps the input sequence h to a sequence of weights for each token j in the vocabulary space  P N N N N N N N P N N N N N N N P N N N N N N N P N N N N N N N P N N N N N N N P N N N N Figure 2. Diagram of STAIR architecture. It employs a dual-encoder architecture. Different than dense models like CLIP or ALIGN, STAIR maps the dense embeddings to a sparse latent embedding space via a token prediction head. In addition to a regular contrastive loss to minimize an image-text matching loss, a FLOPs (Paria et al., 2020) loss is added to encourage sparsity of image and text embeddings.
V and 2) a pooling layer that summarizes the sequence as sparse embedding in the vocabulary space V . We reuse the BERT (Kenton & Toutanova, 2019) masked language model (MLM) prediction head p(·) as the mapping function: where h j = f (·) corresponds to the j th token in the sequence of the feature extraction network output, the TRANSFORM function consists of a FC layer with GELU activation and a LAYER NORM layer, and e and b are the linear mapping and bias in MLM prediction head that maps the transformed embedding to vocabulary space. We tie the weights of linear mapping layer e with the text tower token embedding table. We believe that this step is important to associate the sparse embedding to actual text tokens in V .
Following (Formal et al., 2021b;a), we aggregate the weight of token j to form the sparse embedding ENC: The RELU activation ensures that each token weight is nonnegative and adding the log function empirically achieves better performance by suppressing overwhelmingly large weights (Zhao et al., 2021). After the token projection head, the image embedding ENC IMAGE and text embedding ENC TEXT are sparse vectors living in a |V |-dimensional space defined by the vocabulary.
To further encourage sparsity, we follow (Paria et al., 2020) to introduce the FLOPs regularization loss such that only a small number of token embeddings in V are non-zeros: By combining the contrastive loss and the FLOPs regularization loss, the STAIR model is optimized by: where λ 1 and λ 2 are FLOPs regularization weights for image and text embeddings correspondingly.

Training Details
As noted in Section 2.2, we expect the STAIR model to achieve two goals: 1) aligning text and images in the sparse embedding space; 2) grounding the sparse embedding dimension with human-understandable (sub-)word in the vocabulary. In other words, the image and text embeddings should eventually reside in a space spanned by basis vectors that ground to human-interpretable tokens. However, in practice, we found that simply replacing the dense projection head with a token prediction head alone does not guarantee the 2 nd goal out of the box. This is because the images and text are two very different modalities with semantic gaps, while contrastive loss only encourages text/image alignment. As shown in Section 5.2, the model learns a shortcut to reuse the less common tokens to bridge the gap across modalities. More details can be found in Appendix A. To address this issue, we propose a multi-stage training approach that sequentially bridges the discrepancy between the sparse embedding space and the interpretable space of vocabulary. The main steps are illustrated in Figure 3.
Stage 1: Training image embedding with masked tokens In the first stage, we co-train both the image and text encoders and apply a binary mask on the text embedding as illustrated in Figure 3. Formally, given the original text input y, the masked text embedding is formulated as: where MASK i = 1 if the i th token exists in the input sentence y after tokenization, and g(f (y)) predicts the weights of the non-masked tokens. In this setting, the text embedding is forced to activate the tokens appearing from the original text input only, while ignoring others. By matching with the masked text embedding, the image encoder is learned to ground its image embedding on the tokens from the pairing text. Therefore, after the stage 1 training, the image embedding is living in the vocabulary's interpretable space.
Stage 2: Training with frozen image encoder In this stage we focus on grounding the text embedding to the same interpretable space where the image embedding is trained to reside in from stage 1. The key idea is to let the image encoder teach the text encoder as a teacher model. As shown in Figure 3, we freeze the image encoder while training the text encoder to match the image embedding via contrastive loss. After stage 2 training, both image and text embeddings are in the same human-interpretable embedding space constructed by the vocabulary.
Stage 3: Fine-tuning both encoders The first two-stage training provides a good foundation for both encoders to produce human-interpretable embeddings. However, the image and text encoders are tuned in a round-robin fashion, the parameters are not optimal for image-text matching. In stage 3, we boost the image-text matching performance by finetuning both encoders jointly.

Datasets
Our dataset is a combination of internal and public datasets with 1.1B image-text pairs in total. The public datasets consists of Conceptual Caption 3M (CC-3M) (Sharma et al., 2018) and Conceptual Captions 12M (CC-12M) (Changpinyo et al., 2021). The internal image-text dataset consists of 1B image-text pairs, including a 134M clean licensed dataset (details in Appendix B) along with a 971M noisy web-crawled dataset. The web-crawled dataset is mined using a similar approach described in ALIGN (Jia et al., 2021) and CLIP (Radford et al., 2021). We further filter the data by a public CLIP model 1 , where image-text pairs with similarity score < 0.2 are removed.

Configurations
In the experiment, we train a CLIP model and a STAIR model for comparison. For both models, we adopt transformer (Vaswani et al., 2017) as the backbone with a modified CLIP-B/16 (Radford et al., 2021) configurations. The text encoder is a 12-layer Transformer with 512 hidden dimensions and 8 attention heads. The text input is tokenized by BERT WordPiece tokenizer (Kenton & Toutanova, 2019) with 30,522 vocabularies. The max input sequence length is set to 76. The image encoder is a 12-layer transformer with 12 attention heads and 768 hidden dimension sizes.
The CLIP model is trained for 600K steps using a LAMB optimizer (You et al., 2020) with a learning rate of 1e −3 and a weight decay ratio of 1e −2 . The learning rate is first warmed up till 10k steps and then linear decay to zero. STAIR model takes 300K, 300K, and 600K steps for each of the stage, where each stage use the same configuration as above. To avoid catastrophic forgetting (McCloskey & Cohen, 1989;French, 1999), a smaller max learning rate of 1e −4 is adopted in stage 3. FLOPs regularization weights are set to λ 1 = λ 2 = 1e −3 by default with a quadraitc growth following (Formal et al., 2021a). All models are trained using a global batch size of 16,384 if not specified. 4.3. Zero-Shot Text Image Retrieval Table 1 shows the recall@K (K=1, 5, 10) performance of image/text retrieval tasks on Flickr30K (Plummer et al., 2015) and COCO-5K (Chen et al., 2015). The metrics are reported with a prompt of "a photo of " added before the original caption following Radford et al. (2021). We observe great performance improvement of STAIR models over the CLIP baseline, i.e. 4.9% and 4.3% improvement on COCO-5K text→image and image→text retrieval re-call@1 respectively. A similar improvement is observed on Flickr30K.

Zero-Shot Visual Classification
In Table 3, we evaluate the models on the zero-shot image classification on 9 datasets from Radford et al. (2021). We report the top-1 accuracy and employ the same prompts set from Radford et al. (2021). Results suggest that STAIR performs better or as competitive comparing CLIP on most datasets. In particular, STAIR shows significantly better performance on SVHN classification that requires an exact match, thanks to its token grounding capability. On the other side, we notice that STAIR also suffers on specialized outof-distribution tasks, such as Eurosat (Helber et al., 2019) and RESISC45 (Cheng et al., 2017), similar to CLIP as mentioned by Radford et al. (2021).

Linear Probe of Visual Classification
We also compare the linear probe performance using the same 9 datasets as zero-shot classification. We observe that the STAIR model shows more prominent results comparing the CLIP, even on Eurosat and RESISC45 datasets, where it shows weaker performance in zero-shot. It is because the STAIR model can benefit from its embedding dimensionality of 30,522, which is much larger than 512 for CLIP. Moreover, with sparsity enforced in STAIR, there is no extra cost on storage and computation compared to the dense embeddings (More discussion in Section 6.3).

Quantatively analysis
A key difficulty of interpreting a high-dimensional embedding is that its representation residing in the embedding space R d does not naturally correspond to concepts that humans easily understand. To address the challenge, Kim et al. (2018) proposed to leverage an interpretable space R H that is spanned by vectors c corresponding to an unknown set of human-interpretable concepts. From this standpoint, the embedding interpretation can be viewed as a mapping of R d → R H .
For image-text dual encoder models, a good interpretable space to connect multimodalities is a lexicon space that is spanned by basis vectors c of text embeddings representing human understandable tokens, words, phrases, and/or sentences. The interpretation then can be given by the similarity between the embedding and each basis vector, Sim(·, c). The lexicon space is crucial for comparing the interpretation between different models. Suppose we have the embedding from a dog image, without a dog concept existing in the An airport filled with planes sitting on tarmacs.
ys are standing in a lot holding tennis rackets agull standing on the sand of a beach Original Caption: A seagull standing on the sand of a beach.
A couple of people on kayak boats in the middle of the ocean.
An airport filled with planes sitting on tarmacs.
Boys are standing in a lot holding tennis rackets A seagull standing on the sand of a beach Sheared sheep on roadway taken from vehicle, with green hillside in background.
A young boy tries out skateboarding tricks on a road There is a sign in front of a brick house Two ducks floating together on a body of water.
Bride and grooms arms cutting the wedding cake with fruit on top.
Bride and grooms arms cutting the wedding cake with fruit on top.
A couple of people on kayak boats in the middle of the ocean.

An ta
Boys are standing in a lot holding tennis rackets A seagull standing on the sand of a beach A couple of people on kayak boats in the middle of the ocean.
An airport filled with planes sitting on tarmacs.
ys are standing in a lot holding tennis rackets agull standing on the sand of a beach An airport filled with planes sitting on tarmacs.
A couple of people on kayak boats in the middle of the ocean.
An airport filled with planes sitting on tarmacs.
Boys are standing in a lot holding tennis rackets A seagull standing on the sand of a beach Sheared sheep on roadway taken from vehicle, with green hillside in background.
Sheared sheep on roadway taken from vehicle.
A couple of peo boats in the mi An airport filled with planes sitting on tarmacs.
Boys are standing in a lot holding tennis rackets A seagull standing on the sand of a beach Sheared sheep on roadway taken from vehicle, with green hillside in background.
A young boy tries out skateboar tricks on a road There is a sign in front of a house Two ducks floating together on a body of water.
Bride and grooms arms cutting the wedding cake with fruit on top.
A plane on the runway is being led by a tow cart.
A couple of people on kayak boats in the middle of the ocean. A young boy tries out skateboarding tricks on a road There is a sign in front of a brick house There is a sign in front of a brick house.
A couple of people on kayak boats in the middle of the ocean.
An airport filled with planes sitting on tarmacs.
Boys are standing in a lot holding tennis rackets A seagull standing on the sand of a beach Sheared sheep on roadway taken from vehicle, with green hillside in background.
A young boy tries out skateboarding tricks on a road There is a sign in front of a brick house Two ducks floating together on a body of water.
Two ducks floating together on water.
A couple of peo boats in the mi An airport filled with planes sitting on tarmacs.
Boys are standing in a lot holding tennis rackets A seagull standing on the sand of a beach Sheared sheep on roadway taken from vehicle, with green hillside in background.
A young boy tries out skateboar tricks on a road There is a sign in front of a house Two ducks floating together on a body of water.
Bride and grooms arms cutting the wedding cake with fruit on top.
A  Tokens from STAIR are more friendly for human to interpret the visual concepts than CLIP. ## indicates subword from the vocab.
lexicon space, it is infeasible to understand from the image embedding itself.
Zero-shot image classification is a restricted form of functionally-grounded interpretability evaluation (Doshi-Velez & Kim, 2017) as its interpretable space is predefined by its classes and dataset specific. In practice, the interpretable space can be both lack of human labels and un-limited (Ghorbani et al., 2019). To lift the constraint, we expand our interpretable space as the vocabulary of the BERT WordPiece Tokenizer (Kenton & Toutanova, 2019) (n = 30, 522) to approximate a lexicon space covering any human-interpretable concepts. Note that, under this definition, the embedding space becomes the same as the interpretable space for STAIR and the interpretation mapping reduces to an identity function. Similar to zero-shot image classification, if an image embedding lies closer to its ground-truth class in the lexicon space, we consider it as easier to interpret. This task is usually more challenging than the original zero-shot classification because the candidate class now becomes the entire vocabulary, which is much more than the predefined classes.
We compare the interpretability of CLIP and STAIR on three datasets, ImageNet, CIFAR-100, and Caltech-101. The Top-K accuracy is used to quantitatively measure the model's interpretability. In particular, if a ground-truth class is tokenized into separated sub-words c k in the vocabulary, we take the max similarity over the sub-words max k (Sim(·, c k )) as the final similarity. As shown in Table 4, it is clear that STAIR provides significantly better interpretability than the CLIP model in the interpretable space.
A person riding downhill on skis on a snowy hill, with large mountains in the background.  Figure 6. Selected examples of top predictions and their weights from STAIR models. The predicted tokens from STAIRSINGLE-STAGE are not connected the actual meaning of inputs, while STAIR predictions from multi-stage training are grounded to input semantics. ## indicates subword from the vocab.

Qualitatively analysis
Here we qualitatively examine that the sparse embedding from the STAIR model is more interpretable compared to CLIP. In Figure 5, we report the top Sim(·, c k ) (sub-)words in the interpretable space defined by the BERT WordPiece Tokenizer for each image. We see that STAIR is better at capturing visual concepts that humans can easily understand than CLIP, which is consistent with our quantitative analysis. We also observe that the top tokens from CLIP tend to have similar matching scores while STAIR avoids the problem by adopting eq. (5) in the Token Projection Head. Figure  4 shows more examples of (sub-)words with the highest weights from STAIR embeddings given each image. The results suggest that STAIR is capable of grounding images to tokens that are semantically related. For example, it can infer "wedding" from the picture of bride and groom cutting cake. In practice, this grounding and interpretability ability is very useful for debugging and understanding model's behavior. For example, we observe the model favors activating "https" token in many predicted embeddings. We find that the bias is mainly caused by a large portion of web-mined content in our training data, where "https" occurs in many of the associated texts.

Necessity of Multi-stage training
Multi-stage training strategy is crucial to guarantee that STAIR embedding grounds to meaningful tokens. To demonstrate its necessity, we train a single-stage model, denoted as STAIR SINGLE-STAGE for comparison. Figure 6 demonstrates the predictions from STAIR SINGLE-STAGE and multi-stage STAIR models separately. We observed that Figure 7. Ablation on multi-stage training strategy. Stage 1 has laid a reasonable foundation for zero-shot performance. Stage 2 helps more on ImageNet than Stage 1. Stage 3 further improves both on ImageNet and COCO/Flickr30K. STAIR SINGLE-STAGE tends to redefine the semantic meaning of tokens and reuse them to match images and text. For example, STAIR SINGLE-STAGE does not interpret tokens "eminent" and "nottingham" as famous and locations but redefines them as teddy bear topics, as they are always the top two predicted tokens for teddy bear images. Similarly, "clue" are re-interpreted as the concept of skiing. Although we can guess and infer the new semantic meaning of each token through reverse engineering, the redefined semantic meaning is far from its original one. This makes it hard for humans to interpret the predicted embeddings. In contrast, by adding multi-stage training, tokens are grounded to their original meaning and the embedding is human-readable 2 .

Ablation on Multi-stage training
In this section, we qualitatively study the effect of the multistage training strategy on zero-shot transfer. Figure 7 shows the top-1 accuracy of ImageNet classification and recall@1 on COCO-5K and Flickr30K image-text retrieval for each stage separately. We observe that stage 1 can already achieve reasonable performance even though the activation of text token is restricted while stage 2 helps more on the classification task. Furthermore, stage 3 greatly boosts the text-image matching ability of the text and image encoders together via contrastive learning, which is reflected in all metrics.

Ablation on Sparsity
The embedding sparsity is essential to guarantee the efficiency of similarity computation and retrieval speed. In STAIR, the strength is controlled by the FLOPs regularization weights. To study its impact, we train three STAIR models with regularization weights λ = λ 1 = λ 2 ∈ Number of tokens activated. Subscript t, i indicates the results are from text or image. The horizontal line represents the dense CLIP baseline with embedding size of 512. The effective text embedding size from STAIR is significantly lower than CLIP, which helps speed up the retrieval.
Performance on zero-shot transfer. More sparsity in embedding lowers the zero-shot transfer performance. {1e −2 , 1e −3 , 1e −4 }. We check their corresponding text and image embedding sparsity, i.e. the number of tokens being activated in the predictions, as well as their zeroshot transfer performance. The dataset we use is ImageNet, COCO-5k, and Flickr30k, and the results are summarized in Figure 8.
The results suggest that the regularization weights have a positive impact on encouraging sparsity. Importantly, the effective number of tokens from STAIR is significantly lower than the dense embedding dimension of 512 used in CLIP for the text embedding and comparable for the image embeddings in all 3 settings. Since the complexity of sparse embeddings dot product is linear to the smaller number of non-zero units in two embeddings, STAIR models are more efficient in conducting similarity computation during retrieval compared to CLIP. Moreover, we observe that more tokens are activated in the image embeddings than in the text embeddings. One explanation is that the image semantics is usually broader and more general while the text meaning is more specific as reflected in the sparsity of embeddings. We also notice that when λ is large, text embedding tends to be more sparse when the text inputs are shorter. On the other hand, the regularization weights show a negative impact on zero-shot performance. In particular, the trend is more noticeable in retrieval tasks than in image classification.

Text Encoder Free Applications
The token grounding capability also gives STAIR model the possibility to tackle existing tasks in a more efficient way. We illustrate two potential applications: 1) text encoder-free retrieval, and 2) text encoder-free image-text localization.

Text Encoder Free Image-Text Retrieval
Another peculiar advantage of STAIR is that it enables the possibility of a text-encoder-free retrieval system. We compare the dual-encoder and text-encoder-free architectures in Figure 9. More concretely, we use the image encoder of STAIR to generate the sparse image embeddings while directly converting texts into MASK in the vocabulary space after tokenization as the sparse text embeddings. In contrast to the dual encoders that require inference of both image and text encoder, this architecture only needs the former part, which makes it friendly for applications with restricted latency requirements. On the other hand, it is still different from the retrieval system built with fixed taxonomy using an inverted index because it is capable of taking any free text inputs.
In Table 5, we summarize the zero-shot performance of text-encoder-free STAIR compared with the original STAIR.
Although it obviously underperforms the dual encoder baseline, the results are still encouraging given its potential. We observe that text-encoder-free STAIR shows relatively stronger performance in ImageNet classification compared to text/image retrieval tasks. This is due to the fact that ImageNet classes are more concise in general. In contrast, captions from COCO-5k and Flickr30k usually contain more stop words while they are treated of equivalent importance as the semantically meaningful tokens by MASK, which indicates large headroom in improving the text-encoder-free performance and we leave it as feature work.

Text Encoder Free Localization
STAIR is able to localize the image regions given an arbitrary query (e.g. an object) from the vocabulary without any inference from the text side 3 .
Recall that the image encoder uses a vision transformer that splits the original image into grids. In Equation 4 the mapping function p(·) projects each token/grid representation to the vocabulary space, where each dimension represents the activation of the corresponding token from the vocabulary. Inspired by this, we can find the regions that are correlated to the query by simply looking at the activation score of given input tokens at each grid. Figure 10 shows examples of images given arbitrary text queries and their activation heatmaps. In the first example we show the activation map for queries "German Shepherd Dog", "Dog", and "Cat". Both "German Shepherd Dog" and "Dog" are well aligned with the actual dog in the image. In contrast, the activations of the query "Cat" are across the entire image. We also notice the activation heatmap of "Dog" is aligned better compared to the activation heatmap of "German Shepherd Dog". It is because the multi-token query is decomposed into multiple tokens from the pre-defined vocabulary, and tokens "German" and "Shepherd" usually represent other concepts rather than a dog. A better-tuned vocabulary, e.g. including the "German Shepherd Dog" as a token, will help solve this problem.
We hypothesize that STAIR models get the localization capability because of the max pooling operation. According to (Ranasinghe et al., 2022), simply changing the image and text encoder poolers to max pooling can greatly improve the localization and segmentation capability of a CLIP model. The difference is that CLIP model still needs the encode the text to align with the patch representations from the image, while STAIR doesn't require any inference from the text side. The findings show STAIR model has great potential for localization, segmentation, and open vocabulary detection tasks. We leave it as future work.

Related Work
Image and text retrieval There are two categories of approach for image-text retrieval: dual-encoder approach and cross-encoder approach. Both approaches encode images and text as dense embeddings. Contrastive loss is calculated from the dense embeddings to optimize the model. Frome et al. proposed DeVISE (Frome et al., 2013) as the first dualencoder approach for image-text retrieval. The model contains an image encoder and a text encoder that encodes the image and text input as dense embeddings. Cosine distance is adopted to measure the similarity of image and text pairs. With the advent of transformers, Radford et al. (2021) proposed CLIP that leverages large-scale pretraining datasets and established new state-of-the-art across multiple benchmarks. Dual encoder usually achieves faster retrieval speed, as the embeddings can be pre-computed and cached. Finetuning the pre-trained visual transformer model (Dosovitskiy et al., 2021) and language model (Kenton & Toutanova, 2019) can further boost its performance. Contrary to dualencoder, the cross-encoder approach uses a single encoder to handle both image and text inputs. UNITER (Chen et al., 2020) concatenates the image and text as a single input sequence and employs a multi-layer transformer model to encode them jointly. Compared to the dual-encoder, the cross-encoder is slower at inference but achieves better performance on the benchmarks, as it is capable of capturing the image and text alignment at multiple granularities. STAIR follows the dual-encoder approach. However, instead of encoding the image and text as dense embeddings, it encodes them as sparse embedding. In particular, the sparse embedding from STAIR can be easily human-interpretable and is shown to achieve better results compared to dense embeddings.
Document retrieval via sparse embedding Retrieving documents using sparse embedding is a popular approach in information retrieval (Dai & Callan, 2019;Bai et al., 2020;Formal et al., 2021b;a). As the number of documents increases, the sparse embedding can be hashed and allows fast retrieval using an inverted index system (Formal et al., 2021b;a). Our approach is largely inspired by SPLADE (Formal et al., 2021b) which encodes the text query and documents as sparse embedding and uses an inverted index system to conduct text retrieval. Unlike SPLADE, our approach handles retrieval across modalities, which imposes unique challenges. Due to the semantic gap between image and text, designing a joint sparse embedding space for both modalities is a non-trivial task. Furthermore, grounding the image and text to meaningful tokens in the joint embedding space is challenging. In STAIR, we proposed a streamlined approach to enable fast retrieval speed, interpretation, and high retrieval accuracies.

Conclusion
In this paper, we proposed the sparse text and image representation approach (STAIR) that encodes image and text inputs into sparse embeddings in a sparse token space and multi-stage training strategy to guarantee that the embeddings grounds to meaningful tokens. We compare the STAIR model with CLIP model and results show that STAIR model can significantly outperform on image-text retrieval tasks, and also achieved better performance on various zeroshot and linear probing classification tasks. Moreover, we conduct quantitative and qualitative analysis to demonstrate that our sparse embedding is easy for human to interpret comparing to dense embedding. Sharma, P., Ding, N., Goodman, S., and Soricut, R.

A. Single-stage training vs Multi-stage Training
We train a STAIR model without using the multi-stage training strategy described in Section 3, denoted as STAIR SINGLE-STAGE . The STAIR SINGLE-STAGE model shares the same model architecture as shown in Figure 1 with the multi-stage STAIR model and is trained with the same training configurations as the CLIP model as described in Section 4.2. The FLOPs regularization weights are set to the same value as the STAIR model, i.e. λ 1 = λ 2 = 1e −3 . Table 6 lists the zero-shot image-text retrieval and classification performance of two versions of the STAIR model and the baseline CLIP model. Compared to the multi-stage trained model, STAIR SINGLE-STAGE can achieve similar performance on zero-shot text/image retrieval and classification tasks. They both outperform the CLIP model in most of the metrics. The interpretability, however, is much worse as shown in Table 7. As illustrated in Section 5.2, STAIR SINGLE-STAGE model plays a role as a multi-modal clustering algorithm, which repurposes the words as weighted token centroids. The contrastive objective trains the model to focus on matching the aligned image and text using the token centroids but is inadequate in restricting the predicted tokens to their original human-readable meanings.

B. Internal High-Quality Dataset
The High Quality Image Text Pairs (a.k.a. HQITP-134m) dataset consists of approximately 134m diverse and high quality images paired with descriptive captions and titles. Images range in spatial resolution from 320 to 2048 pixels on the short side. All images are JPEG format and most are RGB. Each example image is associated with a title, and a list of several captions. A small fraction ( 1%) of the examples are missing both captions and title. We favor the associated captions, and find that these tokenize to an average length of 20.1 tokens, although the shortest caption is only one token and the longest is over 1000. This dataset was licensed to our industrial research lab by a third party for commercial use.

C. Our CLIP vs open-source CLIP
In Section 3, both the CLIP and STAIR models presented are trained with a batch size of 16,384. In order to demonstrate the correctness of our implementation, we train a CLIP-B/16 model and compare it with the open-source implementation 4 , denoted as CLIP OPEN , using a doubled batch size of 32,768 in Table 8 We also attach the STAIR model metrics for reference.
It shows our implemented CLIP performs competitively on ImageNet and significantly better on retrieval tasks. Furthermore, STAIR outperforms public CLIP on retrieval tasks despite half of the batch size.

D. Token Based vs Embedding Based Search
Developing embedding-based retrieval systems is emerging recently (Hassantabar et al., 2021;Johnson et al., 2019). Despite the promising progress, deploying a large-scale embedding-based search system is still a big challenge because of several reasons. First of all, the embedding-based systems are often implemented using approximated neighbor search in the large-scale scenario, which usually involves k-means, and product quantization for clustering and partitioning. These operations are computationally costly in scale. Quantization is also often needed for saving disk costs but with dropped precision. Secondly, it usually requires extra computation or even re-computation for refreshing the index with changed data, as operations like k-means, and product quantization are data-dependent.
In contrast, a token-based retrieval system has no such problems because the indexing is through tokens themselves. In addition, the tokens in STAIR are optimized with FLOPs regularizer to encourage the uniform distribution in these sparse tokens, which is beneficial for retrieval. Besides, the interpretable token-based system gives extra benefits: 1) build customized query tree using logical operators; 2) leverage token-based blacklist/whitelist on both query and index side; 3) combine with other token-based futures in the inverted index system.
We also recognize the downside of current work -STAIR is not able to capture high-level semantics. This can be potentially solved by combining dense embeddings and semantic tokens and we leave it as the future work.

E. Image Prediction Weights
We list the detailed weights of predicted tokens for each image from Figure 4 for reference.
An airport filled with planes sitti tarmacs.  Table 9. Qualitative analysis of the sparse embedding