Text Rendering Strategies for Pixel Language Models

,


Introduction
There is a growing movement in NLP towards tokenization-free methods (Clark et al., 2022;Xue et al., 2022;Yu et al., 2023) including pixel-based representations of text (Salesky et al., 2021(Salesky et al., , 2023;;Rust et al., 2023;Tschannen et al., 2023).It has been shown that these tokenization-free methods can readily handle unseen languages and that they are more robust to noise attacks than tokenizationbased models.In addition, pixel-based approaches can effectively exploit visual similarities between characters and scripts because they allow for complete parameter sharing across all inputs, making them a promising direction for multilingual NLP.
Previous work on pixel-based models segments the rendered text into either consecutive patches (Rust et al., 2023;Tschannen et al., 2023) or with  a sliding window (Salesky et al., 2021(Salesky et al., , 2023) ) as in speech processing.Although the proposed approaches have the appealing properties of yielding compact and transferable representations, they also result in a very large input space because there is no unique way to represent lexical units.As a consequence, pixel-based models could observe a new set of image representations with every new sentence, which adds redundancy in the input space and is sub-optimal for developing contextual language representations.We refer to these unstructured rendering strategies as CONTINUOUS and illustrate the point qualitatively in Figure 1 and Figure 2, and quantitatively in Figure 3.In this work, we ask whether structuring the input, which leads to more frequent parameter updates through now-unique word representations, would enable pixel-based models to develop a deeper understanding of context and semantics.We then propose rendering strategies structured around providing the model with a compressed input space.
We demonstrate how enforcing a BIGRAMSstructured rendering strategy leads to both a more capable and data-efficient model: when evaluated on semantic sentence-level tasks, we find that a 22M parameters model performs  competitively with the unstructured original at 86M parameters, and that scaling back up to 86M parameters narrows the performance gap to BERT (Devlin et al., 2019) trained on the same data.In subsequent analyses, we find that the added input structure provokes a clear visual token frequency bias in the learned embedding space.While also found in BERT, frequency biases have been shown to degrade the quality of embedding spaces when word representations are not only determined by semantic relations but also by the number of model updates (Gong et al., 2018;Gao et al., 2019;Fuster Baggetto and Fresno, 2022).We show that frequent words have more context-specific representations than infrequent words, especially in the upper layers.Finally, we show that PIXEL models acquire a non-trivial semantic understanding during pretraining, but that their sentence representations are easily influenced by this frequency bias.We release all models 1 and code 2 for pretraining and finetuning.

Background: modelling text as images
We build upon the general-purpose language encoder framework presented in Rust et al. (2023): PIXEL is a text autoencoder which builds on the Masked Autoencoding Vision Transformer (ViT-MAE; He et al., 2021) and is similarly pretrained with a masked reconstruction objective.However, instead of patches from natural images of objects (Deng et al., 2009), the patches now contain images of text.To go from text to images of text, PIXEL relies on a rendering library (PangoCairo) 3 to produce a sequence-level image which is sliced into image patches of size 16 × 16 pixels.The sequence-length maximum of 529 patches approximately equals the memory requirements of BERT, 1 https://huggingface.co/Team-PIXEL 2 https://github.com/xplip/pixel/tree/TextRenderingStrategies 3 https://docs.gtk.org/PangoCairo the closest benchmark for PIXEL.By using the Google Noto font family which supports the majority of Unicode codepoints,4 the renderer supports all languages that can currently be typeset.
Before the first layer of the PIXEL model, image patches are linearly projected to obtain a sequence of patch 'embeddings'.During pretraining, 25% of embeddings are masked in spans of up to 6 patches and only the unmasked patches with a prepended CLS embedding are passed through the encoder.After replacing the masked embeddings amidst the encoder outputs, relying on fixed sinusoidal position embeddings for ordering information, the decoder predicts the pixel values of solely the masked patches.To later finetune the encoder on a classification task, the decoder can be replaced with a task-specific head and the masking ratio set to 0%.

Structured rendering
Previously proposed approaches to rendering text as images render full sequences of text and segment into either consecutive patches (Rust et al., 2023;Tschannen et al., 2023) or with a sliding window (Salesky et al., 2021(Salesky et al., , 2023)).These CONTINUOUS strategies result in a significant number of uniquely-valued patches, many of which may be observed only once during training.We depict this redundancy in Figure 2 and quantify it in Figure 3, showing how similar text inputs result in unique visual representations.
We compare four rendering strategies: the original unstructured (CONTINUOUS), and three structured (WORDS, MONO, BIGRAMS), as depicted in Figure 1.To render WORDS we separate seg-ments with additional whitespace5 such that new segments begin at the beginning of the next image patch, regulating possible spatial variation.BI-GRAMS, rendering two characters per image patch, is chosen to be widely applicable, without knowledge of word or morphemic segmentation (Mielke et al., 2021;Keren et al., 2022).More specificallyconsider the word pairs ⟨"grow", "growing"⟩ and ⟨"growing", "walking"⟩-the BIGRAMS renderer will produce an overlap of image patches (underlined) for both pairs while the same extent is not guaranteed with WORDS-level rendering as it is regulated by character width.The choice of character (n = 2)-grams is motivated by what generally fits within a 16 × 16 pixels image patch in the setup from Rust et al. (2023).MONO instead applies monospaced fonts where each character is a fixed width; depending on font size, this may result in character bigram patches without breaks within characters, but this is not guaranteed.The main difference between BIGRAMS and MONO is that MONO simply slides across the sentence, two characters at the time, yielding two ways to represent a word whereas BIGRAMS renders the words and then pads with whitespace, ensuring unique inputs. 6s seen in Figure 3, the structured rendering strategies result in a greatly compressed input space as measured by the number of unique image patches processed by the model, but Figure 1 reveals that it comes at the cost of longer sequence lengths.While the rendering strategies we propose were not specifically designed for English, they may not equally generalise to other languages or scripts.We further discuss the representational efficiencies of these strategies in § A.1 and limitations to generalisability under Limitations.

Model scale variants
Recall from Figure 3 that CONTINUOUS rendering produces a significantly larger set of unique image patches compared to other approaches.A consequence of this is that models must learn to encode many almost-identical visual representations, which may be wasteful, both in terms of parameters and training efficiency.Therefore, we hypothesise that PIXEL models that operate over fewer unique image patches can be scaled down without sacrific- ing performance.While "Base" models and larger ones are widely used for their strong performance, proven scaling laws (Touvron et al., 2021;Zhai et al., 2021) enable greater experimentation and model development at smaller scale (Ivgi et al., 2022), which is both more environmentally friendly (Strubell et al., 2019;Bender et al., 2021;Hershcovich et al., 2022) and facilitates contributions with limited computational resources.With this in mind, we propose two smaller architectures which we will compare across downstream tasks in § 5. Our BASE model architecture is directly adopted from ViT (Dosovitskiy et al., 2021) and PIXEL, and we add two more compact SMALL and TINY model variants, as described in Table 1.The configurations of the smaller models are based on the ViT variants presented in Zhai et al. (2021).Following the scaling experiments in He et al. (2021), indicating that shallow decoders of as small as 2 layers can be sufficient for ViT-MAEs, we apply a scheme of halving the number of decoder layers at every scale reduction.

Experiments
We pretrain SMALL models with the proposed rendering strategies.The models are then evaluated on dependency parsing (UDP) with data from Universal Dependencies v2.10 treebanks (Zeman et al., 2022;Nivre et al., 2020) and GLUE (Wang et al., 2018), exploring the models' capabilities at syntactic processing on the word level and semantic processing on the sentence level.

Pretraining
We pretrain all models on the English Wikipedia and Bookcorpus (Zhu et al., 2015) data used by Rust et al. (2023) for direct comparison with PIXEL and BERT, which results in ∼16.8M training examples.We follow the suggested hyperparameters used for PIXEL with the exception of batch size.The smaller architectures of SMALL and TINY allow for larger batch sizes, which we double from 256 examples to 512 and 1024, respectively.We then halve the number of pretraining steps accord- ingly from 1M to 500k and 250k in order to train for the same number of epochs as PIXEL (∼16 epochs, but varying slightly due to differing sequence lengths per rendering strategy).
Pretraining BASE takes 8 days on 8 × 40GB Nvidia A100 GPUs, while in comparison, pretraining SMALL takes less than 48 hours on 8 × 40GB Nvidia A100 GPUs, and TINY less than 24 hours.Loss trajectories for the different rendering strategies are in line with their representational efficiency (Figure 3), indicating that structured rendering may make the masked reconstruction task more data-efficient, achieving a low loss in fewer steps (see § A.2: Figure 10).

Finetuning
To finetune our models for classification tasks we replace the decoder used for pretraining with a task-specific classification head.We do not search for more optimal hyperparameters than those used for PIXEL with the exception of the learning rate; we find that the more compact architectures often benefit from a slightly higher learning rate. 7e follow the same protocol during finetuning as done for PIXEL: for word-level tasks we obtain the rendered image patch indices for every word and as a consequence, the CONTINUOUS strategy becomes identical to the WORDS structure when finetuning on UDP.§ 6.1 further investigates the consequence of a mismatch between how the data is structured during pretraining and finetuning.When finetuning on GLUE the structure follows what was seen during pretraining for all rendering strategies.Reported performances for BERT and PIXEL are taken from Rust et al. (2023).

Rendering strategies
We present averaged results comparing the rendering strategies in the left part of Table 2. Detailed results for each downstream task are presented in Table 4 and Table 5 in the appendix.For UDP we find that the WORDS structure slightly outperforms BIGRAMS and MONO on this word-level task.When comparing the WORDS and CONTINUOUS strategies we get a first hint as to the importance of including structure during pretraining as well, keeping in mind that the rendering structure is the same for both strategies when finetuning on UDP.For GLUE we see a large increase in performance when rendering with any structure and especially BI-GRAMS.We attribute the difference in performance between BIGRAMS and MONO to the unique word representations with BIGRAMS, as discussed in § 3.
We find that BIGRAMS is the best performing structure on average, even slightly outperforming the 86M parameters PIXEL (average UDP: 76.1; average GLUE: 74.1) with only ¼ its model parameters.We provide an investigation into the mechanisms that enable this improved performance on GLUE in § 6.4.Next we pretrain TINY and BASE model variants with BIGRAMS rendering to evaluate performance at different model scales.

Model scaling
The right part of Table 2 compares the different model scales all following a BIGRAMS rendering strategy.Detailed results are likewise presented in Table 4, Table 5, and Table 6 in the appendix.We find that the TINY configuration performs competitively on the word-level tasks considering its only 5.5M parameters, but has a larger gap up to SMALL and BASE on the sentence-level GLUE tasks.SMALL proves to be a good trade-off between scale and performance where it is not far behind BASE on GLUE and even slightly outperforms on UDP. 8 BASE comes a step closer to closing the gap in performance up to BERT on GLUE.Comparing to the performance following a CONTINUOUS rendering strategy, summarised as the difference in average performance (∆µ), it is clear that the more compact the model size, the greater the benefit from structured rendering.
To verify that BIGRAMS rendering does not degrade the performance on multilingual sentencelevel tasks across different scripts and morphologies, we also include results on TyDiQA-GoldP (Clark et al., 2020). 9Again we find that SMALL performs competitively considering its size.

Ablations and supplementary analyses
In this section we investigate how BIGRAMS rendering changes the model compared to CON-TINUOUS.For clarity in what follows, we refer to the BASE model with BIGRAMS rendering from § 5.4 as BASE-BIGRAMS and keep referring to the original model from Rust et al. (2023) as PIXEL.

When does rendering structure matter?
Having established that a structured rendering strategy leads to improved downstream performance, we further investigate when it is needed: is it sufficient to finetune with structure or does the model develop strategy-specific features during pretraining?We analyze this by comparing rendering strategies between pretraining and finetuning.
The results in Table 3 for GLUE show that a mismatch leads to lower downstream performance for both strategies, with BIGRAMS → CONTINUOUS being the most harmful, perhaps unsurprisingly.This result does not align with the finding for UDP in § 5.3 where CONTINUOUS overcomes the change to WORDS-structured rendering.It may indicate that the lower-level UDP tasks are easier for PIXEL-based models than the high-level GLUE tasks (Lauscher et al., 2020).This is in line with the relatively good performance for TINY-BIGRAMS on UDP.
To emphasize the increase in performance on semantic tasks with BIGRAMS rendering, we 8 We expect that BASE could prevail and would benefit from a wider search for optimal hyperparameters during finetuning. 9With the CONTINUOUS rendering strategy, answer spans are extracted such that the answer may include leading or trailing characters when there is no exact mapping from a word to an image patch index.Therefore, we did not include TyDiQA-GoldP in the comparison in § 5.3.More details can be found in Rust et al. (2023) 7 in the appendix.We next turn our attention to how BIGRAMS rendering enables better performance on semantic tasks.

Contextual representations
The extent to which language models capture semantic information is partly determined by their ability to contextualise text (Peters et al., 2018).We therefore analyse how capable BASE-BIGRAMS is at producing contextualised word representations.
We use the Words in Context dataset (WiC; Pilehvar and Camacho-Collados, 2019) of sentences that contain target words (noun or verb) in either a similar (True) or different (False) context across sentence pairs. 10We compute the mean hidden state output over all tokens associated with the target word to obtain a representation.We infer that there is contextualisation if the model generates representations of a target word from different contexts with a low cosine similarity compared to target words in similar contexts.We report this indication of contextuality for each layer of the model, including the input layer, to better understand the properties of the different layers.Similarities between randomly chosen words from random examples (Random) are included as a baseline.11 Figure 4a plots the resulting distributions of similarities.We see that representations of target words from similar contexts have a higher cosine similarity than from different contexts, though with a considerable overlap, and higher for different contexts than for random.When comparing to BERT in Figure 4b, there is a clear difference in the similarity compared to random words.The difference in similarity between similar and random words gradually increases throughout the BASE-BIGRAMS model, until the final layers, whereas the difference steadily decreases throughout the model for BERT.
Given the shared image patch embedding layer in PIXEL-based models, random words are more similar to each other at the input layer when modelled as images than entries in a vocabulary.Taken together, these plots suggest that a PIXELbased language model is capable of forming contextualised word representations and that these are more context-specific in upper layers, though not as fine-grained as seen for BERT.

Token frequency and similarity
The degree of cosine similarity between random words observed in Figure 4a encourages us to assess the isotropic nature of the model (Ethayarajh, 2019; Rajaee and Pilehvar, 2021).The high cosine similarities suggest that the word representations are not evenly distributed with respect to direction in the embedding space, but instead appear to be anisotropic.When learned vector representations populate a narrow cone in the embedding space, this geometric alignment leads to an overestimation of their similarity (Gao et al., 2019), which is not an expected property of an expressive word embedding space (Arora et al., 2016;Mu and Viswanath, 2018). 12ecent work has shown that Transformer-based language models can develop a representation bias driven by token frequency, where low-frequency tokens are clustered together in the embedding space, leading to anisotropy in the model (Gao et al., 2019;Fuster Baggetto and Fresno, 2022;Jiang et al., 2022).This bias leads to poor word contextualisation because the learned vector positions of low frequency words have not moved far from their random initialisation.Thus, their embeddings are not sufficiently distinct from unrelated words with similarly low token frequency (Gong et al., 2018;Cai et al., 2021).Tokens with a higher frequency, and thus more parameter updates, can move further in the embedding space from their initialisation and become more semantically meaningful.Consequently, we hypothesise that compressing the input space in the form of structured rendering allows the model to build more contextualised word representations through more frequent parameter updates.
We investigate this by sampling inputs that were seen during pretraining with high and low frequency.Specifically, we take the 100 most fre- quently occurring words from the Wikipedia corpus that was seen during pretraining and 100 words that occur around 1000 times (rank ≈ 50k). 13We first render each word from the two frequency samples in isolation.We then include a comparison to words in context across 100 unique sentences per word with BASE-BIGRAMS. 14 We plot the distributions of cosine similarities between representations from the last encoder layer, where we expect embeddings from both models to be contextualised.Comparing the plots from the two rendering strategies, summarised in Figure 5, the effect of pretraining with a smaller set of unique tokens becomes clear: for PIXEL the distribution appears as mixtures with a larger distribution mass at higher values of cosine similarity from comparing high-frequency words to other high-frequency (excluding self-similarity for now) than when comparing low-frequency to other low-frequency.For BASE-BIGRAMS the frequent words both in isolation and in-context are less directionally aligned with each other compared to the infrequent, which is in line with the representation degeneration problem from Gao et al. (2019) and more frequent updates leading to better contextualisation.Figure 6 visualises the in-context representations in 2 dimensions using t-SNE (van der Maaten and Hinton, 2008) and provides an additional indication of more frequent words having less locally compact representations. 15 We expect that in-context representations from PIXEL also qualitatively resembles Figure 5a but cannot easily demonstrate this due to the 13 Excluding punctuation and numbers. 14Recall from § 6.2 that the CONTINUOUS rendering strategy by design makes an exact mapping from words in a sentence to neat image patch indices unattainable. 15Plotting the first 2 singular values from a singular value decomposition gives the same qualitative indications.
aforementioned challenges in aligning patch embeddings with CONTINUOUS rendering.

Frequency bias and semantic modelling
While there is less evidence of representation degeneration with CONTINUOUS rendering, it is likely that the poorer performance on GLUE in § 5.4 is caused by PIXEL seeing too many different patches too few times.This is a direct consequence of the multitude of ways that similar inputs can be rendered by the CONTINUOUS approach.However, the drop in performance when mismatching the rendering strategies in § 6.1 for CONTINUOUS → BIGRAMS demonstrates that the model has developed a set of strategy-specific expectations and features that are not easily updated.In fact, the new rendering strategy for finetuning introduces a set of patches that likely never escape the low-frequency domain and therefore remain poorly contextualised.Signs of a token frequency bias has also been found in BERT (Fuster Baggetto and Fresno, 2022).We lastly assess the connection between visual token frequency and downstream semantic performance.With BERT, high-frequency words have the most context-specific representations (Ethayarajh, 2019), and upper-layer representations of low-frequency words are influenced more by their context than frequent words (Voita et al., 2019).Following Ethayarajh (2019), we see that this applies to BASE-BIGRAMS as well (illustrated in Figure 7 and discussed in greater detail in § A.5).We expect that sentences that only vary in being cased or uncased would result in different representations when lowercase appears more frequently (for most words).This demonstrates the impact of observed token frequency on semantic modelling and is in line with observed biases in BERT's embedding space (Jiang et al., 2022).We rely on the Semantic Textual Similarity Benchmark (STS-B; Cer et al., 2017) also found in GLUE for this assessment.We measure the cosine similarity between sentence representations 16 and plot its correlation with the gold standard similarity scores as the measure of performance.Figure 8 proves that both CONTINUOUS and BIGRAMS rendering during pretraining lead to non-trivial semantic modelling capabilties.At peak performance, around the middle layers, the increase from simply ensuring that all words are uncased is roughly the same as the increase from PIXEL to BASE-BIGRAMS.This resembles how frequent and infrequent tokens have unequal influence on their context in BERT (Voita et al., 2019).
Seeing that BASE-BIGRAMS exhibits similar representational traits to that of BERT, future work could aim for more semantically capable PIXELbased models by generalising advances found for tokenizer-based models (Gao et al., 2021).

Related work
Recent work on pixel-based language modelling has demonstrated how visual language understanding can be achieved through pixels only (Lee et al., 2022), observed that the visual similarity of languages plays an important role in cross-lingual transfer (Rahman et al., 2023), and shown how unifying the modalities for text and images allow a single encoder to perform multimodal tasks (Tschannen et al., 2023).By relying on bytes directly, the unification of modalities can be taken even further (Jaegle et al., 2021;Horton et al., 2023;Yu et al., 2023).The work most closely 16 Mean hidden state output across all tokens in a sentence, excluding the CLS token and black end-of-sequence token.
related to ours, after Rust et al. (2023), is the work on machine translation with pixel representations (Salesky et al., 2021(Salesky et al., , 2023)).A detailed discussion of previous pixel-based approaches can be found in Rust et al. (2023, § 5).Where PIXEL laid the foundation for general-purpose language encoding with pixel-based representations, this work takes the first step towards hypothesis-driven improvements without adding additional data (Yang et al., 2019) or scaling up the model (Conneau and Lample, 2019).Though it is possible that competitive performance could be achieved by a model with CONTINUOUS rendering by pretraining on more data for more steps (Liu et al., 2019).
Our addition of BIGRAMS structure resembles the addition of optional but hugely beneficial (n = 4)-grams in the character-based CANINE model (Clark et al., 2022).While character-level n-gram models (Wieting et al., 2016;Bojanowski et al., 2017) have been succeeded by Transformerbased language models, character-level features remain valuable as they are less sparse and more robust to misspellings than word n-grams, and remain useful for especially morphologically rich languages (Garrette and Baldridge, 2013;Kulmizev et al., 2017).Previous work have hypothesised that character-level models would be more suitable than subword-based for modelling morphologically-rich languages (Tsarfaty et al., 2020;Keren et al., 2022), but a semantically capable design has proven non-obvious (Ma et al., 2020;Keren et al., 2022;Nzeyimana and Niyongabo Rubungo, 2022;Sun et al., 2023).We see potential for future work with pixel-based language models exploring appropriate strategies for learning morphological patterns (Klein and Tsarfaty, 2020;Seker and Tsarfaty, 2020;Soulos et al., 2021).

Conclusion
We evaluate four text rendering strategies to address the problem of redundancy in the input space of PIXEL-based language models.Consequently, more frequent parameter updates lead to better contextualised language representations.We find that rendering two characters per image patch (BIGRAMS) is a good trade-off between efficiency and generalisability, resulting in substantial improvements on downstream semantic and sentencelevel tasks; contributing to open-vocabulary NLP with limited computational resources.
Further analyses reveal how the added rendering structure provokes clear representational similarities to what has been found in BERT.We see potential in future work generalising improvements found for tokenization-based masked language models to PIXEL-based masked language models.Furthermore, considering that the Vision Transformer has also been applied to speech modelling (Huang et al., 2022), and that patch representation has been suggested to be a critical component for the success of ViTs (Trockman and Kolter, 2023), we see potential for image patches as the basis for unifying modalities.

Limitations
While the rendering strategies we propose here are well-suited to English, not all equally generalise to other languages or scripts.WORDS rendering relies on word boundaries which may not be readily available or well-defined for many languages which do not mark word or sentence boundaries with whitespace such as Thai or polysynthetic languages such as Inuktitut.MONO and BIGRAMS are more general approaches, but may affect the rendering of positional characters such as diacritics or correct contextual forms based on where boundaries are created.For both approaches, it may be necessary to modulate font size across languages to ensure character pairs fit into a single patch, especially when rendering with diacritics.MONO provides further representational efficiency compared to BI-GRAMS by fixing character width, but comes at the cost of more limited language coverage; many scripts cannot be made fixed-width and fewer than 10 have mono fonts available.CONTINUOUS rendering provides a more general approach which must be balanced with learning efficiency.As seen in Figure 1, structured rendering compresses the input space by reducing the positions characters may be observed in.This dramatically affects the number of unique inputs observed in a fixed number of sequences, as quantified in Figure 3. Concretely, the 10 most frequently observed image patches after processing 100,000 sequences from English Wikipedia are shown in Figure 2; with continuous rendering all are positional variants of the same subword, while with structured rendering each represents different words or morphemes.However, instituting word-or subword-level structure with whitespace padding increases sequence lengths compared to unstructured rendering as quantified in Figure 9.

A Appendix
A.

A.4 TyDiQa-GoldP
The CONTINUOUS rendering strategy used for PIXEL, in which words often overlap in an image patch, leads to extracted answer spans that potentially include leading or trailing characters that should not be part of the answer.BIGRAMS rendering adressess this issue by yielding clear word boundaries in the input representations.However, the BIGRAMS rendering strategy poses new challenges to extracting answer spans for TyDiQA-GoldP.While the task is simplified compared to the primary task by removing language tracks that lack whitespace,17 we find that a surprisingly high number of "words" are a string of comma-separated words or concatenations of characters and letters that should be delimited by whitespace.By design we consider and render these as one unit when we only split by whitespace.An example of a single "unit" from the training split highlights this issue more clearly: "oikeudet[1]Lääni[1]1Vilna523,0501387Vilnan" 18 where the expected answer is "Vilna" and highlighted in bold.In such an instance, a PIXEL BIGRAMS model will predict the whole unit, resulting in a lower performance.Furthermore, some of these "words" in the training data are more than a thousand characters long and therefore do not fit within the maximum sequence length of 529 patches.

Figure 1 :
Figure 1: Examples of rendering strategies for the sentence "I must be growing small again."from Carroll (1865).Black patches mark the end of a sequence, following Rust et al. (2023).
(a) Most frequent patches with CONTINUOUS rendering: (b) Most frequent patches with BIGRAMS rendering:

Figure 2 :
Figure 2: A continuous rendering strategy results in many uniquely-valued image patches for similar inputs, while structured rendering (here, BIGRAMS) regularises and compresses the potential input space.

Figure 3 :
Figure 3: Number of unique image patches observed as a function of training data sequences.Structured rendering results in greater representational efficiency.

Figure 4 :
Figure4: Distributions of cosine similarities for verbs and nouns from the WiC dataset across model layers 0-12, layer 0 being the input layer.Every example presents a target word in either a similar or different context across a sentence pair.The representation of the target word is computed as the mean hidden state output over the corresponding tokens.We generally see that BASE-BIGRAMS encodes target words in a similar context as more similar.The median cosine similarity between random words from random sentences are shown as a baseline.

Figure 5 :
Figure5: Distributions of cosine similarities within samples of high-frequency words (High), low-frequency words (Low), or between the two samples.Rendering with BIGRAMS structure leads to less directionally aligned vector representations of frequent words that have seen more updates during pretraining compared to infrequent words.

Figure
Figure 6: t-SNE plot of the output embeddings of high-and lowfrequency words in context from BASE-BIGRAMS.Low-frequency words cluster tightly in this space.

Figure 7 :Figure 8 :
Figure 7: Self-and intra-sentence similarity from BASE-BIGRAMS.High-frequency words are the most context-specific; low-frequency words are influenced by their context.

Figure 9 :
Figure 9: Distributions of sequence lengths (in patches) resulting from different rendering strategies.

Table 1 :
Details of PIXEL model scale variants.

Table 2 :
Structure (left): averaged results for SMALL-models comparing downstream performance on UDP and GLUE following the different rendering strategies.Scale (right): averaged results across model scales using the BIGRAMS rendering structure.∆µ is the difference in average performance between BIGRAMS and CONTINUOUS rendering for a given model scale.BERT results are marked in grey to visually distinguish from pixel-based models.
. We discuss limitations to answer span extraction with BIGRAMS rendering in § A.4.