Text-Free Image-to-Speech Synthesis Using Learned Segmental Units

In this paper we present the first model for directly synthesizing fluent, natural-sounding spoken audio captions for images that does not require natural language text as an intermediate representation or source of supervision. Instead, we connect the image captioning module and the speech synthesis module with a set of discrete, sub-word speech units that are discovered with a self-supervised visual grounding task. We conduct experiments on the Flickr8k spoken caption dataset in addition to a novel corpus of spoken audio captions collected for the popular MSCOCO dataset, demonstrating that our generated captions also capture diverse visual semantics of the images they describe. We investigate several different intermediate speech representations, and empirically find that the representation must satisfy several important properties to serve as drop-in replacements for text.


Introduction
Although there are over 7,000 languages spoken worldwide (Lewis et al., 2016), only several dozen have enough data available to support supervised speech recognition, and many languages do not even employ a writing system (Adda et al., 2016). In contrast, most people learn to use spoken language long before they learn to read and write, suggesting that linguistic annotation is not a prerequisite for speech processing systems. This line of reasoning motivates research that aims to discover meaningful linguistic abstractions (phones, words, etc.) directly from the speech signal, with the intention that they could reduce the reliance of spoken language systems on text transcripts.
A rich body of work has recently emerged investigating representation learning for speech using visual grounding objectives (Synnaeve et al., 2014;Harwath and Glass, 2015;Harwath et al., 2016;Kamper et al., 2017;Havard et al., 2019a;Merkx et al., 2019;Scharenborg et al., 2018;Hsu and Glass, 2018a;Kamper et al., 2018;Surís et al., 2019;Ilharco et al., 2019;Eloff et al., 2019), as well as how word-like and subword-like linguistic units can be made to emerge within these models (Harwath and Glass, 2017;Harwath et al., 2019;Drexler and Glass, 2017;Harwath et al., 2019;Harwath and Glass, 2019;Havard et al., 2019b;Harwath et al., 2020). So far, these efforts have predominantly focused on inference, where the goal is to learn a mapping from speech waveforms to a semantic embedding space. Generation of speech conditioned on a point in a semantic space has been less explored, and is what we focus on in this work. We hypothesize that generative approaches offer interesting advantages over relying solely on inference. For example, prior works have demonstrated the capability of recognizing visually descriptive words, but have not been shown to learn non-visual words or grammar. Our experiments show that these aspects of spoken language are learned to some degree by a visually-grounded generative model of speech.
Specifically, we introduce a model capable of directly generating fluent spoken audio captions of images without the need for natural language text, either as an intermediate representation or a form of supervision during training (Figure 1). Tremendous progress has been made recently in natural language image caption generation (Kiros et al., 2014;Mao et al., 2015;Vinyals et al., 2015;Karpathy and Fei-Fei, 2015;Rennie et al., 2017;Dai and Lin, 2017;Lu et al., 2017;Anderson et al., 2018;Lu et al., 2018) and naturalistic text-tospeech synthesis (TTS) (Ping et al., 2017;Taigman et al., 2017;Wang et al., 2017;Shen et al., 2018;Oord et al., 2016). Combining these models provides a means for generating spoken image descrip-a person in a blue jacket is on a snowboard on a snow covered slope a snowboarder is snowboarding on the side of the mountain a snowboarder is snowboarding on the side of the mountain Same unit sequence, different speakers Different unit sequences, same speaker Figure 1: Spoken image captions generated from the proposed model, with diversity in both linguistic content and acoustic properties, controlled through the I2U and the U2S models, respectively. Transcriptions are provided only for illustration. Audio samples are available at https://wnhsu.github.io/image-to-speech-demo.
tions, but existing approaches for training these models are reliant on text during training. Instead, we leverage sub-word speech units discovered using a self-supervised learning objective as a drop-in replacement for the text. We hypothesize that by using such techniques, an even wider variety of traditionally text-based NLP models could be applied to speech data without the need for transcription or automatic speech recognition (ASR) systems. Because all human languages utilize small, discrete phonetic inventories (International Phonetic Association, 1999), we posit that our framework should be applicable for any language in the world. In our experiments, we demonstrate that not just any set of discovered speech units can function in this role. We find the greatest success with units that are discrete, exhibit a low frame-rate, and highly robust to speaker and environmental variability. The main contributions of our paper are as follows: 1. The first methodology for fluent image-tospeech synthesis that does not rely on text. A critical aspect of our approach is factorizing the model into an Image-to-Unit (I2U) module and a Unit-to-Speech (U2S) module, where the speech units are discovered in a self-supervised fashion. This approach enables disentanglement of linguistic variability and acoustic/speaker variability.
2. Extensive analysis on the properties required for learned units to replace text. While the idea may seem simple and straightforward, obtaining proper units is not a trivial task. In fact, most of the units experimented in this paper fail to serve as drop-in replacements. Moreover, we demonstrate that what are deemed good units vary significantly for inference and generation.
3. Demonstrating insufficiency of beam search-based evaluation. We show that even when an I2U model fails to generate sensible caption through beam search decoding, it can still pro-duce reasonable captions by sampling from the posterior, hinting that posterior mode-based evaluation can only inspect limited aspects of a model. 4. Proposing a semantic diversity-aware metric. We identify issues of an existing metric (Vijayakumar et al., 2018) and propose M-SPICE for sampling-based evaluation to address the problems.
5. Over 600,000 spoken audio captions for the MSCOCO dataset. We collect 742 hours of speech from 2,352 people tasked with reading each caption out loud. This dataset will be made publicly available to support work at the intersection of speech, language, and vision.

Related Work
Image-to-Text and Image-to-Speech Captioning. Significant progress towards generating realistic (text) captions that describe the content of visual images was made with the advent of deep neural networks (Vinyals et al., 2015;Karpathy and Fei-Fei, 2015;Anderson et al., 2018). Far less work has focused on generating spoken audio captions from natural images. Training an image-to-speech system using separate (image, text) and (text, speech) datasets was explored in (Ma et al., 2019). Hasegawa-Johnson et al. (2017) is the only prior work that has explored image-to-speech synthesis without using text, but with limited results. In that work, BLEU scores were only computed in terms of unsupervised acoustic units, not an estimate of the actual words produced by the synthesizer, which can be problematic as discussed in Section 4. The resulting captions were not evaluated for fluency, naturalness, or intelligibility, and the BLEU scores in terms of the unsupervised units were very low (0.014 on the MSCOCO test set) compared to ours (0.274). Wang et al. (2020b) is a concurrent work that proposes a text-free end-to-end image-to-speech model, which simplifies the task by using pairs of image and synthesized speech generated from a single-speaker TTS model to reduce the acoustic variation. In contrast, by leveraging robust learned units, our I2U module can be trained on real speech with abundant variation, and the U2S module serves as a vocoder that requires a small amount of clean speech (transcripts not needed). Hence, our system imposes less data constraints yet still outperforms Wang et al. (2020b).
Voice Conversion without Text aims to convert the speaker identity in a recording while preserving the textual content (Abe et al., 1990;Stylianou et al., 1998;Toda et al., 2007). It has recently seen progress using neural approaches (Hsu et al., 2016(Hsu et al., , 2017aFang et al., 2018;Chorowski et al., 2018;Chou et al., 2018;Serrà et al., 2019), but the most relevant work to our own is the ZeroSpeech 2019 challenge (Dunbar et al., 2019;Tjandra et al., 2019;, which addresses unsupervised learning of discrete speech units that can replace text and be used as input to TTS models. Unlike image-to-speech synthesis, these tasks only infer phonetic units from given audio recordings instead of generating ones. Speech Pre-Training and Its Applications. Interest in this area has recently surged. Various learning objectives have been proposed, including autoencoding with structured latent spaces (van den Oord et al., 2017;Eloff et al., 2019;Chorowski et al., 2019;Hsu et al., 2017b;Hsu and Glass, 2018b;Khurana et al., 2019), predictive coding (Chung et al., 2019;Wang et al., 2020a), contrastive learning (Oord et al., 2018;Schneider et al., 2019), and more. Prior work addresses inferring linguistic content such as phones from the learned representations (Baevski et al., 2020;Kharitonov et al., 2020;Hsu et al., 2021). In contrast, this work focuses on generating the learned representation from a different modality, which evaluates representations from a different perspective.

Framework Overview
A depiction of our modeling approach is shown in Figure 2. Caption generation for an image involves a cascade of two components: given an input image I, we first generate a linguistic unit sequence U according to the I2U module P (U | I). Given the linguistic symbol sequence U , we generate a speech waveform S according to the U2S module P (S | U ). If the linguistic unit sequence U were to take the form of natural language text, the model would be equivalent to the cascade of a conventional image captioning system followed by a TTS module. Note that we assume S ⊥ I | U because prosody variation is not dependent on the image for the datasets considered.
The key idea in this paper is to instead define U to be a sequence of learned speech units that are as robust and compact as possible like text, but discovered without text supervision. We define inference with this S2U model as U = f (S), enabling us to "transcribe" any given speech audio waveform S into a sequence of units U . The addition of this third component enables us to train P (U | I) from a dataset of images paired with spoken captions {(I 1 , S 1 ), . . . , (I N , S N )}. The conditional independence assumption between S and I given the U enables us to choose any arbitrary speech dataset for training P (S | U ), therefore enabling the speaker characteristics and other acoustic properties to be independently controllable from the I2U system Hsu et al., 2019;Henter et al., 2018;Akuzawa et al., 2018). Table 1 summarizes the five datasets used for training S2U, I2U, and U2S models. Note that we deliberately choose different datasets for training each module, which aims to examine the robustness of the units when transferring across domains, including shift in speaker demography, speaking style (scripted/spontaneous), and linguistic content (book/newspaper/image description). Among the three datasets with image and speech pairs: Places, Flickr8k, MSCOCO, we chose the latter two for training I2U models, because they include five captions per image, which is more suitable for caption metrics such as SPICE (Anderson et al., 2016); moreover, they are commonly used image captioning datasets with many text-based baselines in the literature. Places only contains one spoken caption per image and has not been used for captioning.

Datasets
Specifically, as part of this work we collect Spo-kenCOCO, a spoken version of the MSCOCO captioning dataset (Lin et al., 2014) -to-Unit Model (Show, Attend, and Tell) Figure 2: Diagram of our proposed framework. The ResDAVEnet-VQ model was trained using a {2} → {2, 3} curriculum (in the notation given in Harwath et al. (2020)).   (Tschannen et al., 2020). Since visual semantics are described with words, which in turn are composed of phones, the representations learned by ResDAVEnet-VQ are forced to be predictive of words and phones rather than speaker, noise, etc.

Image
In contrast, many of the speech representations are trained by reconstructing (Chorowski et al., 2019;Hsu et al., 2017b) or predicting unseen speech signals (Chung et al., 2019), which would inevitable capture factors unrelated to the linguistic content. To demonstrate the advantage of representation learning with grounding, we will compare ResDAVEnet-VQ with a reconstruction based model, WaveNet-VQ, trained on the PlacesAudio dataset. We denote the units extracted from this model with WVQ. We use the implementation of Harwath et al. (2020) for ResDAVEnet-VQ, and  for WaveNet-VQ which achieves the best ZeroSpeech 2019 challenge performance.

Unit Selection and Run Length Encoding
Although the ResDAVEnet-VQ model has been shown to be capable of learning both phone-like and word-like units, the experiments in (Harwath et al., 2020) show that only several hundred words are explicitly learned, which tend to be "visual words." Conversely, the phone-like units learned by the lower VQ layers of the model were shown to cover all of the phones in American English (as there are only several dozens). For this reason, we choose to use phone-like units learned by the lower VQ layers to represent U .
Nominally, the VQ layers will output one-hot vectors at a uniform temporal rate, downsampled with respect to the framerate of the acoustic input depending upon which VQ layer is used. Given an input computed with a 10ms frame shift, the two VQ layers investigated in this paper (VQ2 and VQ3) respectively output vectors every 20ms and 40ms. In general, the VQ units are repeated for several consecutive frames. We can decrease the average length of the symbol sequence U by employing a lossy form of run-length encoding (RLE) (see Figure 2) which retains the sequence of symbol identities but discards duration information. Each unit then represents a variable-length segment. This removes the burden of unit duration modeling from the I2U model and shifts it onto the U2S model, which we will show to be crucial.

Image-to-Unit and Unit-to-Speech
Both the I2U model and the U2S model are based upon recurrent seq2seq with attention networks (Bahdanau et al., 2015). Specifically, we adopt Show-Attend-and-Tell (SAT)  for the I2U model. It has an image encoder pre-trained for classification, which is language agnostic and hence should work in any language within our proposed framework. The decoder on the other hand is randomly initialized. We train the SAT model for two stages, where the encoder parameters are only updated in the second stage. We distinguish the models from the two stages with SAT and SAT-FT (finetuned) respectively when presenting the results. For the U2S model, we adopt Tacotron2 (Shen et al., 2018) and WaveGlow (Prenger et al., 2019) for unit-to-spectrogram and spectrogram-to-waveform generation, respectively. In particular, a pre-trained WaveGlow is used without fine-tuning.
The I2U model is trained on (I, f (S)) pairs, which requires pairs of image and speech, while the U2S model is trained on (f (S), S) pairs, which can be obtained from arbitrary set of speech. Both models are trained with the maximum likelihood objective (E I,U [log P (U | I)] for I2U and E S,U [log P (S | U )] for U2S).

Experiments
We design experiments to address three questions: First, how can we measure the performance of an image-to-speech system? Our system can fail to produce a good caption if the I2U model fails to encode linguistic/semantic information into the unit sequence, or if the U2S model fails to synthesize an intelligible waveform given a unit sequence. To better localize these failure modes, we evaluate the full I2S system as well as the U2S system in isolation. We evaluate the U2S system by using it as a vocoder to synthesize unit sequences inferred from real speech and soliciting human judgements in the form of Mean Opinion Score (MOS) and Side-By-Side (SXS) preference tests (Table 2).
To evaluate the I2S system, we can use any method that measures the semantic information contained in the generated speech. We consider two sets of end-to-end metrics: word-based and retrieval-based, and one set of proxy unit-based metrics. Word-based metrics transcribe a generated spoken caption into text (manually or with an ASR system) and then measure word-based captioning metrics against a set of reference captions, such as BLEU-4 (Papineni et al., 2002) (adjusted ngram precision), METEOR (Denkowski and Lavie, 2014) (uni-gram F-score considering word-to-word alignment), ROUGE (Lin, 2004) (n-gram recall), CIDEr (Vedantam et al., 2015) (TF-IDF weighted n-gram cosine similarity), and SPICE (Anderson et al., 2016) (F-score of semantic propositions in scene graphs). This enables comparison between image-to-speech systems with a text "upperbound", but is not applicable to unwritten languages.
Retrieval-based metrics include image-to-speech and speech-to-image retrieval (Harwath et al., 2020), which require a separately trained crossmodal retrieval model for evaluation. Such metrics are text-free, but they cannot measure other aspects of language generation such as syntactic correctness (partially captured by BLEU-4) and scope of the learned vocabulary. Lastly, unit-based metrics are similar to text-based, but replace words with units when computing n-gram statistics. However, systems built on different units are not directly comparable, and second, can be inflated if duration is modeled using unit repetition.
Second, what properties must learned units have to be a drop-in replacement for text? The most essential differences between text and speech are the amount of information encoded and the sequence lengths. Beyond text, speech also encodes prosody, speaker, environment information and the duration for each phone, all of which are minimally correlated with the conditioned images. We hypothesize that learned speech units should discard such information in order to seamlessly connect the I2U and U2S modules. To verify it, we pay particular attention to the variations of the learned units in frame rate (VQ2/VQ3), encoding of duration information (RLE or not), and robustness to domain shift (WVQ/VQ3). Units are run-length encoded by default. Table 2a shows the properties of the units before run-length encoding.
Third, how should language generation models be evaluated more generally? We examine evaluation of the I2S model using beam searchbased decoding as well as sampling-based decoding. We find that because evaluation metrics that are reliant on beam search-based decoding only evaluate the mode of a model's posterior, they do not reflect the ability of a model to generate diverse linguistic content. Furthermore, we show that it is possible for a model's posterior mode to be linguistically meaningless, and yet meaningful language can still be generated with sampling-based decoding. Towards this end, we introduce a novel multihypothesis evaluation metric (M-SPICE), which uses sampling-based decoding (instead of beam search) to generate a set of captions. We can then compute the overall coverage of this caption set against a reference; see Section 4.4 for details.

Evaluating the U2S Model
We construct a Tacotron-2 model for each of the three unit types on the LJSpeech audio data by transcribing each LJSpeech utterances into an unit sequence, then train the U2S model from the RLEed unit sequence and spectrogram pairs. We evaluate the naturalness of the speech produced by each model on held-out data, both in-domain using LJSpeech and out-of-domain (OOD) using Spoken-COCO. 1 Amazon Mechanical Turk (AMT) workers performed Side-by-Side preference tests (SXS) and naturalness evaluation based on mean opinion scores (MOS) on a scale from 1 to 5 for each U2S model, which we display in Table 2. Although VQ2 was preferred for in-domain synthesis on LJSpeech, VQ3 achieved the highest scores and least degradation (-0.387) on the out-of-domain SpokenCOCO, indicating that out of the three units VQ3 has the strongest robustness to domain shift.

Incorporating the I2U Model
We trained an SAT model on SpokenCOCO for each of the three RLE-ed units, as well as VQ3 units without RLE. We also compare to text characters and words; the full hyperparameter and training details for all models are provided in Section B in the appendix, but in general we kept these as constant as possible when comparing different linguistic representations.
Before connecting the U2S model, we noticed that all RLE speech unit models except the one    Table 3. We hypothesize that the reason the VQ2 and WVQ units failed is due to their lack of invariance to domain shift, as evidenced by their decay in naturalness when used for OOD synthesis as shown in Table 2. This may cause the entropy of the unit distribution conditioned on an image to be higher as each phoneme may be represented by multiple units, and therefore the I2U model suffers from the same looping issues as the unconditional language model of text, as observed in (   To evaluate the full Image-to-Speech model, we first train an ASR system on the re-synthesized SpokenCOCO captions using the VQ3 Tacotron-2 model. This enables us to estimate a word-level transcription of the spoken captions produced by our system. In order to verify that the synthesized captions are intelligible to humans and the ASR system did not simply learn to recognize artifacts of the synthesized speech, we asked AMT workers to transcribe into words a set of 500 captions generated by our I2U→U2S system and also evaluated their naturalness. Three workers transcribed and three workers rated each caption, allowing us to compute an MOS score (3.615±0.038), a word error rate (WER) between the 3 human transcriptions (9.40%), as well as an average WER between the human and ASR-produced transcriptions (13.97%). This confirms that our system produces reasonably natural speech and ASR is sufficiently accurate for transcribing synthesized speech. Table 4 summarizes our results on MSCOCO and Flickr8k using beam search. We compare with the literature for bottom-up text captioning (row 1-2) and text-free end-to-end image-to-speech synthesis (row 3). We train the decoder of an SAT model while keeping the image encoder fixed (row 4-6), in addition to fine-tuning the encoder (row 7-9). Despite having no access to text, the SAT-FT speech captioning model trained on VQ3 units achieves a BLEU-4 score of .233 with beam search decoding on MSCOCO. This is very close to the .243 achieved by the original SAT word-based captioning model. Figure 1 shows that the generated captions are fluent and reflect the implicit learning of some syntactic rules. It is evident that the proposed model is capable of generating fluent and meaningful image captions.
Results comparing four unit representations on all three sets of metrics are shown in Table 5. First of all, by comparing word-based and unit-based evaluations, we do note that the relative ranking among VQ3, VQ2, and WVQ is consistent across BLEU-4, METEOR, and ROUGE for SAT models, however, VQ3 \ RLE achieves abnormally high scores on these metrics despite producing trivial captions for all images as shown in Table 3. This is because unit "32" has learned to represent nonspeech frames such as silence, which frequently occurs at both the beginning and end of utterances. Without RLE, consecutive strings of "32" units are extremely common in both the candidate and reference captions, which inflates the scores of this model. The exception here is the CIDEr metric, which incorporates TF-IDF weighting that tends to de-emphasize these kinds of uninformative patterns. Nonetheless, when comparing SAT and SAT-FT with VQ3 units, CIDEr does not rank them the same as word-based metrics.
Regarding retrieval-based evaluation, despite the fact that the ResDAVEnet model was only trained on the original, human-spoken captions for the MSCOCO images, it works very well for the fully synthetic captions. The speech and image retrieval scores for 1k human-spoken validation captions are 0.867 and 0.828 R@10, respectively, while the SAT-FT VQ3 model achieves 0.766 and 0.765 R@10. This indicates that this image-to-speech model is able to infer the salient semantic content of an input image, generate a unit sequence that captures that content, and generate speech that is sufficiently natural sounding for the ResDAV-Enet model to recover that semantic information. Several of the other image-to-speech models also achieve respectable retrieval performance, and the overall ranking of the models mirrors that which we found when using word-based evaluation metrics.

From Mode to Distribution: Evaluating Captions Generated via Sampling
The results in the previous section only evaluate beam search decoding with the I2U model, and do not fully reveal the posterior over captions for an input image, or whether the unit representations that failed with beam search would work well with other methods. To probe this, we evaluate the models using sampling-based caption generation. Figure 3 shows the SPICE scores on SpokenCOCO using beam search and two sampling-based methods. VQ3 still performs the best of all unit types with both beam search and sampled decoding. VQ2 can sometimes generate captions with beam search when the beam is kept small, but as the beam grows it begins to loop and the scores become very low. We see that all unit types can generate reasonable captions when decoding via sampling. Moreover, we discovered that 1) ResDAVEnet-VQ units consistently outperform the WaveNet-VQ units, suggesting that they better capture sub-word structure, and 2) VQ3 \ RLE achieves better scores than VQ2 when using a larger temperature or k for top-k. We estimated the vocabulary size of the SAT-FT model with VQ3 by counting the number of unique recognized words produced at least 3 times when captioning the SpokenCOCO test images. These numbers are shown for the model under the various decoding methods in Figure 4. The number of captions per image is denoted by n, where top candidates are used for beam search and i.i.d. samples are drawn for sampling. Sampling-based decoding reveals a larger vocabulary size than beam search, and the number of words learned by our models (≥ 2 12 ) is far greater than the number of words learned by the ResDAVEnet-VQ model (approx.   (Harwath et al., 2020). We hypothesize that training a model to generate spoken captions encourages it to learn many more words than only being trained to retrieve images from captions. We also hypothesize that because beam search attempts to find the mode of the posterior over captions, it tends to produce a smaller set of words and does not reveal the breadth of the model distribution.

New Diversity-Aware Metric: M-SPICE
The previous section showed that even when the SPICE scores were comparable, sampling-based decoding revealed a much larger model vocabulary than beam search, especially when multiple captions are generated for each image. This highlights a limitation of SPICE in measuring the diversity. Formally speaking, SPICE computes an F-score between two bags of semantic propositions T (S) and T (c) parsed from a set of references S = {s i } i and a hypothesis c, where T (c) denotes a bag of propositions extracted from a scene graph parsed c, and we can compute that for multiple sentences with T (S) = ∪ i (T (s i )).
To extend SPICE for scoring multiple hypotheses C = {c j } J j=1 , one can compute an average SPICE: 1 To address the deficiencies of the existing metrics, we propose a new metric named multicandidate SPICE (M-SPICE), which takes the union of the candidate propositions and computes the F-score against the reference propositions: F 1(T (S), ∪ j T (c j )). M-SPICE assigns a higher score if the set captures diverse and correct propositions, and it is obvious that the score of C 2 is higher than C 1 as desired. Figure 5 shows the M-SPICE scores of our SAT-FT model using VQ3 units on SpokenCOCO. When evaluating over multiple captions (n > 1), using the beam search hypotheses increases the score less than sampling.

Disentangled Voice Control for
Image-to-Speech Synthesis We examine to what extent the VQ3 units are portable across different speakers by training a U2S model on the VCTK dataset that additionally takes a speaker ID as input. The resulting model is able to generate speech with the voice of any VCTK speaker. We evaluate the captions produced by this system on SpokenCOCO for 5 speakers in Table 6.
To compute these scores we transcribe the captions generated by each model into text using the ASR system we describe in Section 4.2, which was solely trained on re-synthesized SpokenCOCO captions using the LJSpeech U2S model. The scores in Table 6 indicate not only that the I2U model can be easily integrated with U2S models representing a diverse set of speakers, but also that the LJSpeech ASR system works very well on the speech synthesized from the VCTK models.

Conclusion
In this paper, we presented the first model capable of generating fluent spoken captions of images without relying on text, which almost matches the performance of early text-based image captioning models. Our comprehensive experiments demonstrated that learned units need to be robust, of low framerate, and encoding little or none duration information to be a drop-in replacement for text. We also identified the caveats of mode-based evaluation and proposed a new metric to address semantic diversity. As part of this work, a novel dataset of over 600k spoken captions for the MSCOCO dataset is introduced, which we will make publicly available to the research community. Future work should investigate applying the proposed method to additional languages, devising improved speech unit representations, and jointly training the speech unit model with the I2S model. This would offer the opportunity to explore new analysis-by-synthesis training objectives. A Visually-Grounded Speech Datasets Table A1 displays details of the three visuallygrounded speech datasets used in this paper. When computing duration statistics, we exclude utterances longer than 15s for SpokenCOCO and Flickr8k Audio, and 40s for Places Audio, because we found that those utterances resulted from incorrect operation of the data collection interface (e.g., workers forgot to stop recording). When computing vocabulary sizes and word statistics, text transcripts are normalized by lower-casing all the alphabets and removing characters that are neither alphabets nor digits.
For the SpokenCOCO data collection on Amazon Mechanical Turk, we displayed the text of a MSCOCO caption to a user and asked them to record themselves reading the caption out loud. For quality control, we ran a speech recognition system in the background and estimated the wordlevel transcription for each recording. We computed the word error rate of the ASR output against the text that the user was prompted to read, and only accepted the caption if the word error rate was under 30%. In the case that the word error rate was higher, the user was asked to re-record their speech. We paid the users $0.015 per caption recorded, which in conjunction with the 20% overhead charged by Amazon resulted in a total collection cost of $10,898.91.  Table A1: Statistics and properties of the three visually-grounded speech datasets used in the paper.

B Detailed Experimental Setups
In this section, we provide details about data preprocessing, model architecture, and training hyperparameters for each module used in this paper. The same setups are used for all unit types unless otherwise stated.
B.1 Image-to-Unit Model Data Images are reshaped to 256×256×3 matrices and are per-channel normalized with µ = [0.485, 0.456, 0.406] and σ =[0.229, 0.224, 0.225]. During training, unit sequences are truncated or padded to the target length shown in Table A2. The target lengths are determined such that there are less than 10% sequences truncated while still allowing a reasonable batch size to be used. Units that occurred less than five times are excluded. Sequences are not truncated during evaluation. We follow the data splits used in (Harwath et al., 2020) for Places, and (Karpathy and Fei-Fei, 2015) for Flickr8k and SpokenCOCO (the "Karpathy split").

Model
We adopt an open-source reimplementation 2 of Show, Attend, and Tell  (SAT) with soft attention, which replaces the original CNN encoder with a ResNet-101 pre-trained on ImageNet for image classification. The last two layers of the ResNet are removed (a pooling and a fully-connected layer) such that the encoder produces a 14×14×2048 feature map for each image.
Training Adam (Kingma and Ba, 2015) with a learning rate of 10 −4 is used for optimizing both stages (SAT and SAT-FT). The training objective is maximum likelihood combined with a doubly stochastic attention regularization introduced in  with a weight of 1. Dropout is applied to the input of decoder softmax layer with a probability of 0.5 during training. Gradients are clipped at 5 for each dimension. The first stage is trained for at most 30 epochs, and the best checkpoint from which is used to initialize the second Figure A1: Utterance duration histograms for the three visually-grounded speech datasets. Figure A2: M-SPICE F-score (same as Figure 5)  Top-K Sampling (t = 1.0) Top-K Sampling (t = 0.7) t = 1.0 t = 0.7 t = 0.4 t = 0.1 k = 10 k = 5 k = 3 k = 10 k = 5 k = 3    Figure 4. and how little acoustic properties of interest are affected by them.
Training A batch size of 64 are used for all systems. Adam (Kingma and Ba, 2015) with an initial learning rate of 10 −3 is used to minimize the mean square error from spectrogram prediction and the binary cross entropy from stop token prediction combined. L2 regularization for the parameters with a weight of 10 −6 is applied, and the L2 norm of the gradients are clipped at 1. Models are trained for 500 epochs on LJSpeech and 250 epochs on VCTK, and selected based on the validation loss. Empirically, each training epoch on LJSpeech takes about 12 minutes using two NVIDIA Titan X Pascal GPUs for both VQ2 and VQ3 models. Table A3 presents the word-based evaluation results of decoding via sampling for all 5 metrics, supplementing Figure 3 in the main paper that only presents the SPICE results. We see that ranking between symbols are generally consistent among all those metrics, except the ranking between WVQ and VQ3 \ RLE when sampling with a temperature of 0.4. This is a relatively low-score regime when both model are transiting from generating trivial caption (t = 0.1) to non-trivial captions (t = 0.7).

D Full Results of Learned Vocabulary Size
In Table A4, we display the numerical results depicted graphically in Figure 4.
E More Image-to-Speech Samples Table A5 shows captions sampled from the VQ3 model trained on MSCOCO. Here, we note that the sampled captions exhibit diversity both their content and linguistic style. We observe that the captioning model has learned to produce captions that correctly use quantifiers and conjugate verbs ("a couple of cows walking" vs. "a cow is standing"). The model also disentangles object identity from attributes such as color "red fire hydrant" vs. "yellow fire hydrant" vs. "green fire hydrant").