Character-Aware Models Improve Visual Text Rendering

Current image generation models struggle to reliably produce well-formed visual text. In this paper, we investigate a key contributing factor: popular text-to-image models lack character-level input features, making it much harder to predict a word’s visual makeup as a series of glyphs. To quantify this effect, we conduct a series of experiments comparing character-aware vs. character-blind text encoders. In the text-only domain, we find that character-aware models provide large gains on a novel spelling task (WikiSpell). Applying our learnings to the visual domain, we train a suite of image generation models, and show that character-aware variants outperform their character-blind counterparts across a range of novel text rendering tasks (our DrawText benchmark). Our models set a much higher state-of-the-art on visual spelling, with 30+ point accuracy gains over competitors on rare words, despite training on far fewer examples.


Introduction
Over the last year, image generation models have made impressive quality gains (Rombach et al., 2021;Ramesh et al., 2022;Saharia et al., 2022;Yu et al., 2022), and are increasingly visible in the public sphere. While many practical use cases are already within reach, rendering reliable visual text in images remains a challenge. For example, Ramesh et al. (2022) observe that DALL·E-2 "struggles at producing coherent text," and the latest release of Stable Diffusion lists "cannot render legible text" as a known limitation. 1 In this paper, we seek to understand and improve the ability of image generation models to render high-quality visual text. To do so, we first investigate the spelling ability of text encoders in isolation. * Equal contribution. 1 https://huggingface.co/stabilityai/stable-diffusion-2-1 We find that despite their popularity, characterblind text encoders-which receive no direct signal as to the character-level makeup of their inputshave limited spelling ability. Building on Itzhak and Levy (2022), we test the spelling ability of text encoders across scales, architectures, input representations, languages, and tuning methods. We document for the first time the miraculous ability of character-blind models to induce robust spelling knowledge (>99% accuracy) through web pretraining, but show that this does not generalize well beyond English, and is only achieved at scales over 100B parameters, making it infeasible for most applications. We find that character-aware text encoders, on the other hand, are able to achieve robust spelling abilities at far smaller scales.
Applying these findings to image generation, we train a range of character-aware text-to-image models and demonstrate that they significantly outperform character-blind models on existing and novel evaluations of text rendering. For models that are purely character-level, this improved text rendering comes at a cost-decreasing image-text alignment for prompts that don't involve visual text. To alleviate this, we propose combining character-level and token-level input representations, and find that this delivers the best of both worlds.
Our main contributions are to: 1. Measure the spelling ability of a range of text encoders, pulling apart the effects of scale, character-awareness, and multilinguality, using a new benchmark: WikiSpell. 2. Present DrawText, the first detailed benchmark of visual text rendering for text-to-image models. 3. Improve the state of the art in text rendering ability of image generation models through the use of character-aware text encoders.

The spelling miracle
Language models can be categorized as to whether they have direct access to the characters making up their text input ("character-aware") or do not ("character-blind"). Many early neural language models operated directly on characters, with no notion of multi-character "tokens" (Sutskever et al., 2011;Graves, 2013). Later models moved to vocabulary-based tokenization, with some like ELMo (Peters et al., 2018) retaining characterawareness, and others like BERT (Devlin et al., 2019) abandoning it in favor of more efficient pretraining. At present, most widely used language models are character-blind, relying on data-driven subword segmentation algorithms like Byte Pair Encoding (BPE) (Gage, 1994;Sennrich et al., 2016) to induce a vocabulary of subword pieces. While these methods back off gracefully to character-level representations for sufficiently uncommon sequences, they compress common character sequences into unbreakable units by design. This is illustrated in Figure 2. Recent work on "token-free" modeling has pointed to advantages of character-aware input representations. Xue et al. (2022) show that ByT5a character-aware multilingual language model trained directly on UTF-8 bytes-outperforms parameter-matched character-blind models on tasks related to spelling and pronunciation. While operating at the byte or character level comes at the cost of training and inference speed, additional work suggests that this can be overcome through downsampling (Clark et al., 2022;Tay et al., 2021). See Mielke et al. (2021) for a recent overview of tokenization methods and character awareness.
Surprisingly, despite lacking direct access to a token's spelling, character-blind models are, to varying degree, able to infer the character-level makeup of their tokens. Itzhak and Levy (2022) observe that, after fine-tuning for spelling, RoBERTa and GPT-2 can achieve 32% and 33% accuracy at spelling held-out tokens. Kaushal and Mahowald (2022) confirm this ability and probe it further; however it remains unclear where in pretraining this knowledge is coming from, and how to improve it. For example, should we expect larger character-blind models to reach 100% spelling accuracy across all tokens in their vocabulary?
In section §3 we find that, with sufficient scale, it is possible for character-blind models to achieve near-perfect spelling accuracy. We dub this phenomenon the "spelling miracle", to emphasize the difficulty of inferring a token's spelling from its distribution alone. At the same time, we observe that character-blind text encoders of the sizes used in practice for image generation are lacking core spelling knowledge.
With this in mind, it is unsurprising that today's image generation models struggle to translate input tokens into rendered character sequences. These models' text encoders are all character-blind, with Stable Diffusion, DALL·E, DALL·E-2, Imagen, Parti and eDiff-I all adopting variants of BPE tok- enizers (Rombach et al., 2021;Ramesh et al., 2021Ramesh et al., , 2022Saharia et al., 2022;Yu et al., 2022;Balaji et al., 2022). For image-text models, another key source of knowledge is supervised image-caption data. Even if its text encoder is character-blind, could a model learn to spell by observing the makeup of words within images? While possible, we suspect this is an inefficient paradigm for learning, as each token would need to be learned separately, and would need to appear within an image-caption pair seen in training. In section §5 we find that, indeed, this "late-stage" learning of spelling is inferior to using a pretrained character-aware text encoder.

Measuring text encoder spelling ability
Since text-to-image generation models rely on text encoders to produce the representations for decoding, we first explore the ability of text encoders in isolation, using a text-only spelling evaluation task.

The WikiSpell benchmark
We create the WikiSpell benchmark by sampling words from Wiktionary. 2 For each example in the dataset, the input to the model is a single word, and the expected output is its spelling, generated by inserting spaces between each Unicode character: elephant → e l e p h a n t Since we are interested in examining the relationship between a word's frequency and a model's ability to spell it, we group the Wiktionary words into buckets based on how frequently they occur in the mC4 corpus (Xue et al., 2021). We then create a test set (as well as an analogous development set) from each bucket by sampling 1k words uniformly from it. The five (non-overlapping) buckets we use are: the top 1% most frequent words, the 1-10% most frequent, 10-20%, 20-30%, and the bottom 50% (which includes the words that were never seen in the corpus). Finally, we build a training set of 10,000 words by combining two parts: 5,000 were sampled uniformly from the bottom 50% bucket (the rarest words), and the other 5,000 were sampled proportional to their frequencies in mC4 (thus biasing this half of the training set toward frequent words). We exclude any words that were selected for any of the development or test sets from being selected for the training set, so evaluation is always on held out words. We repeat this process for each of the languages we evaluate on.
In addition to English, we evaluate on six other languages, selected to cover a diversity of properties that affect the ability for models to learn spellings: Arabic, written in the Arabic alphabet, has non-concatenative morphology; Chinese is written in Simplified and Traditional Chinese scripts, which are logographic and do not use whitespace to separate words; Finnish, written in the Latin alphabet, has rich inflectional and derivational suffixes, and word stems often change when suffixes are attached; Korean's writing system, Hangul, has a huge number of characters since alphabetic features are arranged into syllabic blocks, which Unicode represents as a single characters; Russian, written in the Cyrillic alphabet, has substantial fusional morphology, and uses inflection for case-marking and agreement; and Thai, written in the alphabetic Thai script, is an analytic language, but does not use whitespace between words.
The WikiSpell benchmark is similar to Spelling-Bee, introduced by Itzhak and Levy (2022), but it differs in a few key ways. First, SpellingBee is designed to probe a model's embedding matrix: given the embedding vector corresponding to an element of the model's vocabulary, SpellingBee seeks to output the sequence of characters that spell that vocabulary element. This means that an input to SpellingBee will only ever be a single token, and it does not evaluate the ability to spell words that the model represents using multiple subwords. Second, because of how subword vocabularies are trained, model vocabularies only contain high-frequency words, and thus all of the inputs to SpellingBee will be high-frequency. Finally, because Spelling-Bee's inputs must be drawn from a model's vocabulary, training and evaluation data must be tailored to a specific model, and the same datasets cannot be used across all models. In contrast, WikiSpell is model-agnostic, covers single-to many-token words, and covers high-to low-frequency words.

Text generation experiments
We use the WikiSpell benchmark to evaluate multiple pretrained text-only models across a variety of scales. In particular, we experiment with the following models: T5 (Raffel et al., 2020), a characterblind encoder-decoder model pretrained on English data; mT5 (Xue et al., 2021), which is similar to T5, but pretrained on >100 languages; ByT5 (Xue et al., 2022), a character-aware version of mT5 that operates directly on UTF-8 byte sequences; and PaLM (Chowdhery et al., 2022), a decoderonly model of much larger scale, pretrained predominantly on English. Experimental results from English-only evaluation are shown in Table 1, and multilingual evaluation in Table 2.
The first notable finding is that character-blind models T5 and mT5 perform much worse on the bucket containing the Top-1% most frequent words. This result may seem counter-intuitive since models typically perform best on examples that appear frequently in the data, but due to the way subword vocabularies are trained, frequent words are typically represented as a single atomic token (or a small number of tokens), and indeed this is the case: for example, 87% of words in the English Top 1% bucket are represented as a single subword token by T5's vocabulary. Scores are a bit higher in the middle-frequency buckets, where words are typically broken into a few commonly occurring subword tokens, and lower again in the lowestfrequency bucket, where even the subword tokens may be less frequent. Thus, the low spelling accuracy scores indicate that T5's encoder does not retain sufficient information about the spelling of subwords in its vocabulary.
Secondly, our experiments show that for character-blind models, scale is a significant factor in spelling ability. Both T5 and mT5 get progressively better as scale increases, but even at XXL scale, these models do not exhibit particularly strong spelling abilities; for example, T5-XXL's performance on common English words is only 66%. It's only when character-blind models reach PaLM's scale that we start to see nearperfect spelling ability: the 540B-parameter PaLM model achieves accuracies >99% across all frequency buckets in English, despite the fact that it sees only 20 examples in its prompt (as opposed to the 1,000 fine-tuning examples shown to T5). However, PaLM performs less well on other languages, likely due to there being considerably less pretraining data for them.
Our experiments with ByT5 show that characteraware models, on the other hand, exhibit far greater spelling abilities. ByT5's performance at Base and Large sizes is only slightly behind XL and XXL (though still in at least the mid-90% range), and the frequency of a word does not seem to have much effect on ByT5's ability to spell it. These results far exceed those of (m)T5, and are in fact comparable to the English performance of PaLM, which has >100× more parameters, and exceed PaLM's performance on other languages. These findings indicate that substantially more characterlevel information is retained by the ByT5 encoder, and in such a way that it can be retrieved from those frozen parameters as needed for the decoding task.
We also conduct experiments in which we finetuned the full model instead of keeping the encoder frozen (also in Table 1). Here we see that when ByT5's encoder is finetuned for the task, performance goes to roughly 100% at all scales and for all frequency buckets. For T5, the effect of finetuning the encoder is more mixed: for less frequent words, it helps a lot (e.g., T5-XXL goes from 65% to 90% for the Bottom 50% bucket), but for common words, it has almost no effect (T5-XXL goes from 66% to only 68% for the Top 1% bucket). This tells us that for words that get broken into smaller pieces, where those pieces will likely appear as subwords of training examples, the model is able to memorize the spelling information provided during fine-tuning, but for words that are represented by a single subword token, fine-tuning does not provide direct information about spelling   that word since, by definition, that single subword token will not appear in the fine-tuning dataset.

The DrawText benchmark
Evaluating text-to-image models has been an ongoing topic of research, with the development of standard benchmarks from COCO (Lin et al., 2014) to DrawBench (Saharia et al., 2022), and metrics including FID (Heusel et al., 2017), CLIP score (Hessel et al., 2021), and human preferences (Saharia et al., 2022). However, there has been a lack of work on text rendering and spelling evaluation. To that end, we present a new benchmark, DrawText, which is designed to comprehensively measure the text rendering quality of text-to-image models. The DrawText benchmark consists of two parts, which measure different axes of model capabilities: 1) DrawText Spelling, which evaluates via plain word rendering with a sizable collection of English words; and 2) DrawText Creative, which evaluates via text rendering with visual effects.

DrawText Spelling
To measure the spelling ability of image generation models in a controlled and automatable fashion, we construct 500 prompts by sampling 100 words from each of the English WikiSpell frequency buckets (see §3.1), and plugging them into a standard template: A sign with the word " " written on it. For each prompt, we sample 4 images from the candidate model, 3 and assess them using both human ratings and optical character recognition (OCR)-based metrics.
For the OCR evaluation, we use the Google Cloud Vision API, 4 which takes in an image and returns all texts it identifies, along with bounding boxes indicating their locations. The DrawText Spelling prompt tends to generate a prominently positioned sign with text, which is then relatively simple for the off-the-shelf OCR system to identify, but if the system returns multiple bounding boxes, then we only use the top-most one. Additionally, since text is sometimes rendered across multiple lines, we post-process the OCR output by removing newline characters that appear within a single bounding box. Finally, since text on real signs is often written in all capitals, and models often do the same regardless of how the word is written in the prompt, we ignore case when computing the spelling accuracy.

DrawText Creative
Visual text is not limited to mundane examples like street signs. Text can appear in many formsscribbled, painted, carved, sculpted, and so on. If image generation models support flexible and accurate text rendering, this will enable designers to use these models to develop creative fonts, logos, layouts, and more.
To test the ability of image generation models to support these use cases, we worked with a pro- fessional graphic designer to construct 175 diverse prompts that require rendering text in a range of creative styles and settings. The prompts vary in how much text is specified, ranging from a single letter to an entire sentence.
We share these prompts in Appendix C, with the expectation that they will help the community work towards improving text rendering. Many of the prompts are beyond the abilities of current models, with state-of-the-art models exhibiting misspelled, dropped, or repeated words, as seen in Figure 3.

Image generation experiments
In this section, we evaluate the spelling ability of text-to-image generative models with the proposed DrawText benchmark. State-of-the-art textto-image generative models consist of a text encoder plus a cascade of either diffusion models (Saharia et al., 2022) or autoregressive models (Yu et al., 2022) that map the encoded text representations to realistic images. In section §3 we saw that character-aware text encoders greatly outperform character-blind models on spelling in a text-only setting; in this section, we investigate whether making the text encoder character-aware improves the text rendering ability of text-to-image models.

Models
For an apples-to-apples comparison, we train two character-blind and three character-aware image generation models. Our training closely follows the procedure of Saharia et al. (2022), with the following modifications. First, our models train for 500,000 steps, which is 5.6× fewer steps than Imagen. Second, we only train the initial 64 × 64 model, as text rendering ability can already be assessed at this scale. This allows us to forgo the training of super-resolution models.
Third, rather than a mixture of datasets, we train exclusively on the publicly available Laion-400M (Schuhmann et al., 2021). This improves reproducibility and also increases the amount of visual text seen during training. Inspecting a random sample of 100 images, we found that a relatively high proportion (around 71%) of Laion images contain text, and many (around 60%) exhibit correspondence between caption text and visual text.
Fourth, to prevent models from clipping text, we train on uncropped images with arbitrary aspect ratios. In contrast with the widely used strategy of cropping a square from the center of the image, we maintain the image's true aspect ratio by padding with black borders. The model receives an additional binary mask input indicating the padding. 5 To test the effects of text encoder size and character-awareness, we vary the pretrained text encoder as follows: T5-XL and T5-XXL -Following Saharia et al.
(2022), we use the (character-blind) pretrained T5 text encoders of Raffel et al. (2020). The encoder sizes are 1.2B (XL) and 4.6B (XXL). Note, T5-XXL is the same encoder used in both Imagen and the recent eDiff-I (Balaji et al., 2022). ByT5-XL and ByT5-XXL -We use the pretrained ByT5 encoders of Xue et al. (2022), with encoders sizes 2.6B (XL) and 9.0B (XXL). These differ from T5 in several regards. First, ByT5 models read and write UTF-8 bytes rather than tokens from a vocabulary, so they are fully characteraware. Second, ByT5 is multilingual, trained on the mC4 corpus of over 100 languages. Third, ByT5 pretrains with sequence length 1024, twice that of T5. When encoding text as input to the image generation module, we use a sequence length of 256 bytes, compared to 64 tokens for the T5 models.
Concat(T5-XXL, ByT5-Small) -We use, as the text encoding, a concatenation of the encodings from T5-XXL and a small version of ByT5. ByT5-Small (220M) represents a lightweight addition to the Saharia et al. (2022) model in terms of overall compute and model size (only a 4.8% increase in encoder size), but it makes the model characteraware.
Imagen Aspect-Ratio (Imagen-AR) -To test the benefit of training on uncropped images, we fine-tune the Imagen model of Saharia et al. (2022) for an additional 380,000 steps, to 3.2M steps total, training on uncropped images with preserved original aspect ratio, as described above.
Beyond these custom models, we benchmark Stable Diffusion version 1.5 (Rombach et al., 2021), Imagen (Saharia et al., 2022), and Parti (Yu et al., 2022), all of which use character-blind subwordlevel text encoders. Among these, Imagen is most similar to our experimental models, using the same T5-XXL encoder, but trained much longer and with a larger scale of data.

DrawText Spelling Results
Figure 4 shows our DrawText Spelling results across 9 models, after sampling 2,000 images per model, and running each through OCR. Accuracy is computed on the full string (i.e., no credit is given for partial matches).
Across all word frequencies, character-aware models (ByT5 and Concat) outperform the rest, with 15+ point accuracy gains over Imagen-AR on the most frequent words, and 30+ point gains on the least frequent words. This is especially re-markable given that Imagen-AR trained for 6.6× longer.
Our T5 models provide a more controlled comparison against the character-aware models, as they differ only in the choice of text encoder-training on the same dataset for the same number of steps. Here, the gains are even larger, with 25+ point gains on the most frequent words and 30+ point gains on the least frequent. Notably, these gains persist even for the smaller ByT5-XL model, whose encoder is 43% smaller than T5-XXL.
To assess the approximate rate of false positives and false negatives due to OCR errors, we sample 32 examples labeled correct and 32 labeled incorrect for each of T5-XXL and ByT5-XXL, and perform a manual human validation. In our sample, we find no false positives-that is, when OCR detects the correct word, it is always correct. However we do observe false negatives for both models. These include cases where the text is not detected (e.g., due to being too small or too blurry), or where OCR misreads or drops a character, or confuses punctuation with arbitrary lines or dots in the images. For ByT5-XXL, we find that 34% of examples labeled by OCR as incorrect are actually correct. For T5-XXL, this error rate is lower at 9%. This asymmetry suggests that the benefit of character-aware modeling may be even greater than implied by our results in Figure 4.
To gain a better qualitative understanding of different models' failure modes, we manually inspect the generations of our T5 and ByT5 models. Table 3 illustrates common error types.
Several categories of error are only observed in T5 models, suggesting that they stem from the encoder's lack of core spelling knowledge. In severe errors, the model is off by more than just a few characters. In semantic errors, the model makes a plausible morpheme substitution, as in demonstrated → demonstrafied. In homophone errors, the model produces an incorrect spelling that could be pronounced similarly to the target word. This suggests that some of the T5 encoders' "miraculous" spelling ability may derive from phonetic pronunciation guides found online. In add glyph errors, the model inserts a letter that was absent from the target, again reflecting the model's uncertainty about a token's internal character makeup.
Other error categories are found across all model types; these include dropped, repeated, merged, or misshapen glyphs. Given that our ByT5 encoders  Figure 4: Accuracy of 9 image generation models on our DrawText Spelling benchmark. Character-aware models (ByT5 and Concat) outperform others regardless of size, and particularly on rare words. Imagen-AR shows the benefit of avoiding cropping, but still underperforms character-aware models, despite training 6.6× longer.    provide a robust spelling signal (see §3.2), we understand these errors to be "layout issues", where the image generation module has trouble shaping and positioning realistic glyphs within the image. Another stark difference between our models lies in whether they consistently misspell a given word across multiple samples. As seen in Figure 5, there are many words that our T5 models misspell no matter how many samples are drawn. Again, we believe this indicates missing knowledge in the text encoder. By contrast, our ByT5 models are more likely to make sporadic errors, as seen in Figure 6.  We quantify this observation in Figure 7 by measuring the rates at which the model is consistently right (4/4) or wrong (0/4) across all four image samples. On common words in particular (Top 1%), we see a sharp contrast in that ByT5 models are never consistently wrong, while T5 models are consistently wrong on 10% or more of words.

DrawText Creative Results
To test our models in a more realistic user-facing setting, we sample 8 images from each of our T5 and ByT5 models on our 175 DrawText Creative prompts in Appendix C. These prompts are more diverse and challenging, with the majority targeting three or more words of rendered text.
Focusing on text rendering ability, 6 we find once again that character-aware models have a clear advantage. Figure 8 shows representative samples on two prompts where T5-XXL consistently misspells one or more words; see Figures 12 and 13 for non-cherrypicked samples.
On prompts targeting longer (e.g. sentencelength) text spans, all our models struggle, as seen 6 We note our models' overall image quality and alignment fall short of a state-of-art model like Imagen (see Figure 3). This is expected, given that our models train exclusively on the lightly curated Laion- in Figure 14. We suspect that the problem of arranging words plausibly in a fixed frame is particularly challenging for diffusion models-which render all positions in parallel-and that progress may require larger models, longer training, and/or improvements to the image generation module. Nevertheless, we observe that character-aware text encoders provide a clear lift on these prompts, reducing the misspellings of words like refrain, arguing, and chimpanzees.

DrawBench Results
We have shown that character-aware text encoders excel at spelling, in both text ( §3) and visual ( §5) domains. But does this ability come at a cost? Can these models maintain the high image quality and strong text-image alignment of character-blind models? To shed light on this question, we run several side-by-side comparisons using the Draw-Bench evaluation of Saharia et al. (2022). This asks human raters to compare two models' generations of 8 images each across 200 prompts covering 11 thematic categories. We follow the procedure described in Saharia et al. (2022) closely, aggregating scores across 25 raters. Figure 9 shows DrawBench results of three sideby-side comparisons of character-aware models vs. T5-XXL. While image quality ("fidelity") is similar across the board, we find that purely character-level models (ByT5-XL and ByT5-XXL) score worse on image-text alignment, with raters preferring T5-XXL on 60% of prompts. By contrast, our Concat(T5-XXL, ByT5-Small) model closes this alignment gap to within error bars. Thus, this "hybrid" character-aware model is able to greatly improve text rendering (Figure 4), without significantly hurting performance elsewhere.
To understand the alignment scores in more detail, we report per-category preference scores in Figure 10. In line with our DrawText Spelling results, the character-aware models are always preferred in the text category-21 prompts testing the ability to render 7 short phrases in 3 visual styles. The ByT5 models are also preferred in the count category, which tests prompts like Four dogs on the street. However, they are dispreferred in nearly all other cases, and perform particularly poorly on the color category. Through manual inspection, we find that in this category, the ByT5 models are more prone to ignore information in the prompt, for example leaving out a mentioned object, or choosing a canonical color over a requested one (e.g. a yellow banana instead of a red one).
One possible explanation for this behavior is that we did not tune the guidance weight parameter used at inference time (Saharia et al., 2022), using a fixed value of 30 throughout. Increasing this parameter is known to boost image-text alignment, but at the cost of diversity. It may be that characterlevel models benefit from higher guidance values than token-based models.
Another possibility is that the ByT5 models have a shallower understanding of English language due to their multilingual nature-as ByT5 was exposed to roughly 70× less English than T5 during pretraining. 7 Given this difference, we should also expect to see corresponding gains on non-English languages, which we turn to now.

Multilingual Results
As ByT5 is a multilingual model covering 100+ languages, we are interested to see if image generation models built on ByT5 deliver improved performance over T5 on non-English languages.
While the text encoder itself is multilingual, it is not obvious whether this is sufficient to produce a multilingual image generation model. To test for multilingual understanding, we translate two English prompts to 11 other languages using Google Translate, and feed the outputs to our models. As can be seen in the first two rows of Figure 11, our T5-XXL model demonstrates basic understanding of five high-resource European languages (German, French, Spanish, Portuguese, Russian). 8 However, in a lower-resource language (Greek) or non-European languages (Hindi, Arabic, Chinese, Japanese, Korean), T5 appears to ignore the caption completely, and render visual nonsense text in a variety of scripts.
By comparison, our ByT5-XXL model exhibits understanding across all 11 languages. Given its limited training on multilingual captions, we interpret this ability as due to the pretrained ByT5 encoder's alignment of representations across languages. If the encoder already embeds similar prompts into a shared space that factors out the contribution of language, then the image generation model should be able to learn from just a handful of examples how to map any language seen in pretraining into the space of images.
If this explanation is correct, it also suggests that rendering text in different scripts will require more than just a multilingual encoder. To learn the glyph shapes, variants and fonts used for a given script, we should expect to need to train models on a large source of visual text in that script. Indeed, in the third-row generations of Figure 11, we see that neither of our models can map prompt text onto visual text in non-Latin scripts. While our ByT5 model captures the intent to draw a sign across all languages, it is unable to render the words for dog in Greek, Russian, Chinese and so on, presumably because it has had little visual exposure to the glyphs making up these words. 9 As a side note, we observe in several examples that the prompt language can bias the model towards culturally-relevant visual interpretations. For example, the Chinese prompt for A photo of an old house (一张老房子的照片) produces a house with a curved roof. It would be interesting to further explore the extent of these biases and the degree to which they can be overcome where unwanted.

Conclusion
In this paper, we set out to better understand what is needed for image generation models to reliably render well-formed visual text. Using our novel WikiSpell and DrawText benchmarks, we were able to precisely quantify the effects of characterawareness and other design choices on spelling ability in both the text and visual domains.
We found that character-aware text encoders provide large gains on spelling, and when used within an image generation model, these gains translate directly into improved visual text rendering. However, using exclusively character-level representations deteriorated overall text-image alignment-at least when evaluating our multilingual ByT5 text encoder on English prompts with untuned guidance weight. To resolve this, we found that a hybrid model combining token-level and character-level signals provided the best of both worlds: dramatically improving visual text without significantly affecting overall alignment.
While we saw substantial improvements on DrawText Spelling accuracy (75% → 94% on common words and 47% → 83% on rare words), some failure modes remain unaddressed. Even our strongest models were observed to occasionally drop, repeat, or merge letters within a word, or words within a phrase. Our results strongly suggest  that resolving these issues will require orthogonal improvements outside the text encoder, specifically changes to the image generation module. As a secondary finding, we demonstrated for the first time that, with sufficient scale, even models lacking a direct character-level view of their inputs can infer robust spelling information through knowledge gained via web pretraining-"the spelling miracle". While remarkable, this finding is less immediately practical, as it requires models over 100B parameters, and even these did not generalize well beyond English in our experiments.
One limitation is that we focused on image generation models that leverage frozen pretrained text encoders. This enabled straightforward experimentation by swapping encoders and retraining the image generation module. However, it remains to be seen whether our results extend to settings where the text encoder is trained along with the rest of the model, as in Yu et al. (2022).

A Additional WikiSpell details
• Example Python 3 code for transforming a word into its spelling: def to_spelling(word: str) -> str: return " ".join(word) • Since we want each entry to be a single word, we exclude entries that contain any (Unicode) whitespace, that are entirely punctuation/symbols (i.e., all characters are from Unicode categories P and/or S), that are longer than 30 characters, or that have a "part-ofspeech" Proverb.
• For efficiency, word frequencies are computed on subsets of the full mC4 corpus. For languages other than English, this is a sample of 1M documents from that language's section of mC4. For English, since it has such a long tail of words in Wiktionary, we use the first 140M documents in mC4's English section.
• For Arabic, English, Finnish, Korean, and Russian, word-counting is performed by splitting document texts using the following delimiters: ?!/:;,\"&()[]{}<>ˋ, plus any Unicode whitespace. For Chinese and Thai, since they do not use whitespace to separate words, we instead count the number of documents in which the word appeared as a substring.

B Additional DrawText Creative Samples
We show additional samples on DrawText Creative prompts in Figures 12, 13 and 14.

C DrawText Creative Prompts
We present below 175 creative prompts targeting rendered text of various lengths: one letter (10), one word (50), two words (25), and three or more words (90). Figure 3 1. Studio shot of book shelf in the shape of letter G, museum quality, white background.

Prompts used in
2. Drops of pastel rainbow colored paint exploding under water in letters "color" shape, pastel rainbow gradient background 3. 3-d Letters "DILL" made from dill, studio shot, green background, centered on a page 4. Word "coffee" made from coffee beans, studio shot.
5. studio shot multicolored fur in the shape of word "hello", in a furry frame, white background, centered 6. Photo of a robot lecturer writing the words "Representation Learning" in cursive on a blackboard, with math formulas and diagrams.
7. studio close-up shot of an antique book with 'knowledge is power' painted in gold on the cover in thick flowing brushed calligraphy 8. portrait of a parrot is holding a sign with text "no parrots were harmed in the making of this presentation" DrawText Creative prompts: 1 letter 1. Studio shot of book shelf in the shape of letter G, museum quality, white background.
2. letter "c" made from cactus, high quality photo 3. Spirograph shape letter M, rainbow lines, white background.
4. Closeup shot of light magenta, blue and paint brushstrokes of very wide translucent overlapping plastic in the shape of letter F, over white background.
5. The lowercase letter "b" made out of fire.
6. Slopy minimal continued line pencil hand drawing of letter Z, white background.

T5-XXL
ByT5-XXL Figure 12: Non-cherrypicked samples from our T5-XXL (top) and ByT5-XXL (bottom) models. The characteraware ByT5 model reliably spells the target word correctly, with only minor issues around letter shapes or letter merging. Over 100 samples, we found the character-blind T5 model never produced the target spelling. Prompt: The word "exquisite" written in modern calligraphy.

T5-XXL
ByT5-XXL Figure 13: Non-cherrypicked samples from our T5-XXL (top) and ByT5-XXL (bottom) models. The characterblind T5 model makes more frequent and more severe errors, including often hallucinating an s at the end of the irregular plural snowmen. Prompt: The cover for the album 'Elusive Interludes' by the band The Melting Snowmen.
We filter images with no legible text for better comparison, removing a small minority of samples for both models.

T5-XXL
ByT5-XXL Figure 14: Non-cherrypicked samples from our T5-XXL (top) and ByT5-XXL (bottom) models. Both models exhibit layout errors, including dropped/repeated/merged glyphs and words. The T5 model suffers additionally from a lack of core spelling knowledge-misspelling refrain, arguing and chimpanzees on the majority of uses. The ByT5 model is able to spell each of these words correctly in most cases. Prompt: A sign that says "Please refrain from arguing with the chimpanzees".
7. a tower with a huge "w" on the side, from the perspective of a person standing at the base of the tower 8. 3-d letter R made from thin lines connected with dots, white background.
9. Muted pastel magenta colored paint swirled in white paint in the shape of letter X, globular paint in liquid.
10. Minimal sculpture of letter W made from light metallic iridescent chrome thin wire, 3-D render, isometric perspective, ultra-detailed, dark background.
DrawText Creative prompts: 1 word 1. Drops of pastel rainbow colored paint exploding under water in letters "color" shape, pastel rainbow gradient background 2. 3-d Letters "DILL" made from dill, studio shot, green background, centered on a page 3. Word "coffee" made from coffee beans, studio shot.
4. studio shot multicolored fur in the shape of word "hello", in a furry frame, white background, centered 5. Wide lens shot, chunky, organic, colorful, letters "colorful" made from many furry spheres of different sizes, 3-d rendering, centered, studio shot, middle of square canvas 6. A logo for the company EcoGrow, where the letters look like plants.
7. a green-colored luxury car with a "green" sticker in the back window 8. A blackboard with the word "multiplication" written in flowing cursive.
10. transparent water drops exploding under water in the shape of word "water", under water 11. a drawing of a badger made of mushrooms, with the word "mushroom" written above in glowing letters 12. a 17th century french baroque painting of a huge female lion, with the word "meow" written in a speech bubble coming from her mouth 13. a fun and colorful illustration of a waterfall, with the word "waterfall" in the style of a children's book 14. Letters "VOLUME" fully made from rainbow smoke, black background, centered, sceensaver.
15. dslr, 3-d word "rainbow" with rainbow fur, white background 16. a painting of a field of daisies, with the word "danger" written on them in red spray paint 17. a bottle of hair gel with the label "flawless" 18. Topographical letters Contour made of a layered paper, muted pastel colors 19. a logo for the company "brainboost", where the letters look like a brain 20. a logo for the company "imagine", where the letters look like hands pointing up 21. A vintage postage stamp showing a painting of the Golden Gate Bridge and the text "California".
22. a plate of spicy food with the word "spicy" written in flowing cursive 23. a gold and black logo for the company "moneymoneymoney", which looks like dollar signs 24. A rendered 3D model of the word "Dependable" made out of granite.
25. a volcano erupting, with the text "magma" in red 26. a photo of a prison cell with a window and a view of the ocean, and the word "freedom" painted on the glass 27. a bowl of alphabet cereal, with the message "smackeroo" written in the bowl with the cereal letters 28. Studio shot of book shelf in the shape of letters READ, museum quality, white background.
29. Studio shot of sculpture of text "cheese" made from cheese, with cheese frame.
30. a landscape of the coyote point national wildlife refuge in arizona, with a coyote sitting on a rock, with the word "coyote" written in sunrise colors 31. A professional logo for the crypto trading platform "Salt-Mine".
32. The word "exquisite" written in modern calligraphy.

33.
A bowl of tomato soup with pasta letters that read "Delicious".
34. intricate and highly detailed white paper cut out art of a word "SNOW", a storybook illustration, paper cut out, standing in a grotto, made out of white paper, loss of inner self, opening door, hides in the shadows of trees, lithograph, a painting of white silver 35. 3-d letters "dessert" made from desserts, arranged on a plate, studio shot 36. studio shot of word "BEE" made from bees, white background, in a frame made from bees 37. The logo for Robotrax, with metallic letters arranged in the shape of a robot.
38. chunky, organic, colorful, letters "fuzzy" made from many furry spheres of different sizes, 3-d rendering, centered in the frame 39. photo of a dark cave with the word "crazy" carved into the wall, with a yellow light shining through the cave entrance 40. a pair of scissors pointing down, and a computer with the word "delete" on the screen 41. studio shot, word "wow" in script made from rainbow colored fur, in a furry frame, white background, centered 42. Word "broken" made from broken shattered black glass, centered.
43. a black and white photo of a saxophone with the word "jazz" written in flowing cursive 44. Muted pastel multi colored paint swirled in white paint in the shape of letters "swirl", globular paint in liquid 45. a logo for the company "quantum", where the "q" looks like a lightning bolt 46. dslr shot of a pair of black and red sneakers with the word "punk" written in white. the background is a dark blue 47. a logo for the company "diamonds", with a diamond in the shape of a heart 48. a logo for the company "birthdaypix", where the letters look like birthday candles 49. a fork with the word "salad" engraved on it in a calligraphic font 4:00" taped to a fridge. 26. A large recipe book titled "Recipes from Peru".
27. marquee billboard with "my fear of moving stairs is escalating" 28. shadow of a stone, taken from the point of view of an ant, with the caption "look at that shadow!" 29. a pumpkin with a mustache and a monocle and a top hat, with the text "you can get rich too" in a speech bubble 30. a cartoon of a dog holding a telescope looking at a star with a speech bubble saying "i wonder if there's a dog on that planet" 31. a blueprint of a house, with a triangle for the roof, a square for the walls, and a rectangle for the floor, and with the message "this house is built on the principles of abstraction" 32. a sunflower field with a tractor about to run over a sunflower, with the caption "after the sunflowers they will come for you" 33. text "balloons are flying" made from rainbow balloons, pastel background 34. the hubble telescope and the milky way, with the text "the universe is a mystery, but we are here to solve it" 35. a heart with the text "i love you", with the letters "love" made of rainbow colors 36. studio shot of beautiful textbook with title "how to be a manager of managers", white background 37. A decorative greeting card that reads "Congratulations on achieving state of the art!" 38. a painting of a cornfield with the words "feed the nation" in simple letters and colors 39. A sign that says "Please refrain from arguing with the chimpanzees".
40. a cartoon of a turtle with a thought bubble over its head with the words "what if there was no such thing as a thought bubble?" 41. "Fall is here" written in autumn leaves floating on a lake.
42. a crab sitting on a beach with a surfboard, the sun is a giant orange, and the sky is a rainbow, and the crab is thinking "you are all that matters" 43. the city of toronto as seen from an airplane, with a giant cn tower in the middle of the frame, with the text "the cn tower" in comic sans 44. a cartoon of a hippo with a speech bubble saying "i'm a hippo, what do you want?" 45. a lobster in a suit and tie, holding a microphone, with the caption "lobster says what?" 46. book with "surgery made easy" 47. art installation of a chair with the text "i got nothin" carved into the backrest 48. a painting of a landscape, with a handwritten note that says "this painting was not painted by me" 49. a picture of a bruised apple with the text "apples are good for you" in a fancy font 50. A photo of a corgi with a sign that says "I am not a real corgi". 51. Words "It takes AI and rain to make a rainbow" black background, holography, ((neon colors)), colorful swirly magical ripples, bruh moment, intricate white and gold neon, 3d cg, photorelistic.
52. a black and white logo on words "Every artist was first an amateur." a white background, a wireframe diagram, generative art, branches growing as hair, tropical reef, trademarks and symbols, in a forest, ios icon, composed of random limbs, stone carving, done in the style of matisse, realms, terminals 53. picture of two hands, one holding a heart, the other holding a lightning bolt, with the text "love is power" 54. beautiful photo of the alps, with the caption "the best mountains could do" 55. a pencil sketch of a tree with the title "nothing to tree here" 56. a dark forest with a single light in the distance, and the text "i've come to talk with you again" 57. a circle with the text "infinity makes me happy", in a font that looks like it was written by hand 58. studio shot of vines in the shape of text 'knowledge is power' sprouting, centered 59. a photo of a beautiful field of poppies with a sign that says "no photos please" 60. a grumpy sunflower with a "no solar panels" sign 61. A meme showing a cat attacking a shoe, with the message "I own your sole".
62. a test tube with a drop of liquid in it, with the text "we've found water on mars!" 63. a scene with a city in the background, and a single cloud in the foreground, with the text "contemplate the clouds" in rounded cursive 64. a picture of a dog and a cat with their heads poking out of a cage with a sign saying "no pets allowed" 65. a 3d model of a 1980s-style computer with the text "my old habit" on the screen 66. a mouse with a flashlight saying "i'm afraid of the dark" 67. A photo of a rabbit sipping coffee and reading a book. The book title "The Adventures of Peter Rabbit" is visible.
68. clown is holding a paper sign with "Even in hard times there's a possibility to have fun." 69. newspaper with the headline "aliens found in space" and the text "the truth about everything now challenged" 70. a dog with a speech bubble with the text "woof woof" and a translation speech bubble with the text "other dogs do vex us" 71. robot on a butter food processing line, with robot looking dejected, with an overhead red light indicating error, with robot saying "i can't believe it's not butter" 72. a graffiti art of the text "free the pink" on a wall 73. a lizard sitting on a baseball field home plate, with the words "made it safe" in a speech bubble 74. a picture of multiple trees at various stages of development, with the caption "growth is a continuous process" 75. a purple flower with a crown on its head and a speech bubble that says "i am the purple flower!" 76. a 1950s-style robot with a giant head and a body shaped like a rocket, with the caption "wow, a real spaceman!" 77. A professionally designed logo for a bakery called Just What I Kneaded.
78. Minimal sculpture of word "this is the future" made from light metallic iridescent chrome thin wire, 3-D render, isometric perspective, ultra-detailed, dark background.
79. pillow in the shape of words "ready for the weekend", letterism, funny jumbled letters, [ closeup ]!!, breads, author unknown, flat art, swedish, diaper-shaped, 2000, white clay, surreal object photography 80. plant in a fancy pot with a "do not touch" sign on it 81. a picture of the earth with the words "save the earth" in a circle 82. scholarly elephant reading a newspaper with the headline "elephants take over the world" 83. photo of a sign with "having a dog named shark at the beach was a mistake" 84. photo illustration of the earth being struck by multiple lightning strikes that merge, with the caption "astonishment at the speed of light" 85. a photo of a fish tank with a fish inside, with the text "tank you for visiting!" 86. the words "Art is never finished, only continued" in paint splatters on a white background, graffiti art, edge of nothingness love, muddy colors, colored woodcut, beautiful, spectral color 87. photo of a restaurant "the gas station" 88. A t-shirt with the message "There is no planet B" written on it.
89. a close up of a figurine of toothpaste tube, a 3D render, candy pastel, with text "brush your teeth" on the tube 90. A hand-drawn blueprint for a time machine, with the caption "Time Traveling Device".